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Abstract 

Most of applied statistics involves regression analysis of data. In practice, it is im¬ 
portant to specify a regression model that has minimal assumptions which are not vio¬ 
lated by data, to ensure that statistical inferences from the model are informative and 
not misleading. This paper presents a stand-alone and menu-driven software package, 
Bayesian Regression: Nonparametric and Parametric Models, constructed from MAT- 
LAB Compiler. Currently, this package gives the user a choice from 83 Bayesian models 
for data analysis. They include 47 Bayesian nonparametric (BNP) infinite-mixture re¬ 
gression models; 5 BNP infinite-mixture models for density estimation; and 31 normal 
random effects models (HLMs), including normal linear models. Each of the 78 regres¬ 
sion models handles either a continuous, binary, or ordinal dependent variable, and 
can handle multi-level (grouped) data. All 83 Bayesian models can handle the analysis 
of weighted observations (e.g., for meta-analysis), and the analysis of left-censored, 
right-censored, and/or interval-censored data. Each BNP infinite-mixture model has a 
mixture distribution assigned one of various BNP prior distributions, including priors 
defined by either the Dirichlet process, Pitman-Yor process (including the normalized 
stable process), beta (two-parameter) process, normalized inverse-Gaussian process, 
geometric weights prior, dependent Dirichlet process, or the dependent infinite-probits 
prior. The software user can mouse-click to select a Bayesian model and perform 
data analysis via Markov chain Monte Carlo (MCMC) sampling. After the sampling 
completes, the software automatically opens text output that reports MCMC-based 
estimates of the model’s posterior distribution and model predictive fit to the data. 
Additional text and/or graphical output can be generated by mouse-clicking other 
menu options. This includes output of MCMC convergence analyses, and estimates of 
the model’s posterior predictive distribution, for selected functionals and values of co¬ 
variates. The software is illustrated through the BNP regression analysis of real data. 
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1 Introduction 


Regression modeling is ubiquitous in empirical areas of scientific research, because most 
research questions can be asked in terms of how a dependent variable changes as a function 
of one or more covariates (predictors). Applications of regression modeling involve either 
prediction analysis (e.g., Dension et ah, 2002; Hastie et ah 2009), categorical data analysis 
(e.g., Agresti, 2002), causal analysis (e.g., Imbens, 2004; Imbens & Lemieux, 2008; Stuart, 
2010), meta-analysis (e.g.. Cooper, et ah 2009), survival analysis of censored data (e.g., Klein 
& Moeschberger, 2010), spatial data analysis (e.g., Gelfand, et ah, 2010), time-series analysis 
(e.g., Prado & West, 2010), item response theory (IRT) analysis (e.g., van der Linden, 2015), 
and/or other types of regression analyses. 

These applications often involve either the normal random-effects (multi-level) linear 
regression model (e.g., hierarchical linear model; HLM). This general model assumes that 
the mean of the dependent variable changes linearly as a function of each covariate; the 
distribution of the regression errors follows a zero-mean symmetric continuous (e.g., normal) 
distribution; and the random regression coefficients are normally distributed over pre-dehned 
groups, according to a normal (random-effects) mixture distribution. Under the ordinary 
linear model, this mixture distribution has variance zero. For a discrete dependent variable, 
all of the previous assumptions apply for the underlying (continuous-valued) latent dependent 
variable. For example, a logit model (probit model, resp.) for a binary-valued (0 or 1) 
dependent variable implies a linear model for the underlying latent dependent variable, with 
error distribution assumed to follow a logistic distribution with mean 0 and scale 1 (normal 
distribution with mean 0 and variance 1, resp.) (e.g., Dension et ah, 2002). 

If data violate any of these linear model assumptions, then the estimates of regression 
coefhcient parameters can be misleading. As a result, much research has devoted to the 
development of more flexible, Bayesian nonparametric (BNP) regression models. Each of 
these models can provide a more robust, reliable, and rich approach to statistical inference, 
especially in common settings where the normal linear model assumptions are violated. 
Excellent reviews of BNP models are given elsewhere (e.g.. Walker, et al., 1999; Ghosh & 
Ramamoorthi, 2003; Muller & Quintana, 2004; Hjort, et al. 2010; Mitra & Muller, 2015). 

A BNP model is a highly-flexible model for data, dehned by an inhnite (or a very large 
hnite) number of parameters, with parameter space assigned a prior distribution with large 
supports (Muller & Quintana, 2004). Typical BNP models have an inhnite-dimensional, 
functional parameter, such as a distribution function. According to Bayes’ theorem, a set of 
data updates the prior to a posterior distribution, which conveys the plausible values of the 
model parameters given the data and the chosen prior. Typically in practice, Markov chain 
Monte Garlo (MGMG) sampling methods (e.g.. Brooks, et al. 2011) are used to estimate the 
posterior distribution (and chosen functionals) of the model parameters. 

Among the many BNP models that are available, the most popular models in practice are 
inhnite-mixture models, each having mixture distribution assigned a (BNP) prior distribution 
on the entire space of probability measures (distribution functions). BNP inhnite-mixture 
models are popular in practice, because they can provide a hexible and robust regression 
analysis of data, and provide posterior-based clustering of subjects into distinct homogeneous 
groups, where each subject cluster group is dehned by a common value of the (mixed) random 
model parameter(s). A standard BNP model is dehned by the Dirichlet process (inhnite-) 
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mixture (DPM) model (Lo, 1984), with mixture distributiou assigued a Dirichlet process 
(DP) (Fergusou, 1973) prior distributiou ou the space of probability measures. Also, ofteu 
iu practice, a BNP model is specihed as au iuhuite-mixture of uormal distributious. This is 
motivated by the well-kuowu fact that auy smooth probability deusity (distributiou) of auy 
shape aud locatiou cau be approximated arbitrarily-well by a mixture of uormal distributious, 
provided that the mixture has a suitable uumber of mixture compoueuts, mixture weights, 
aud compoueut parameters (meau aud variauce). 

A flexible BNP iuhuite-mixture model ueed uot be a DPM model, but may iustead have 
a mixture distributiou that is assigued auother BNP prior, dehued either by a more geueral 
stick-breakiug process (Ishwarau & James, 2001; Pitmau, 1996), such as the Pitmau-Yor 
(or Poissou-Dirichlet) process (Pitmau, 1996; Pitmau & Yor, 1997), the uormalized stable 
process (Kiugmau, 1975), the beta two-parameter process (Ishwarau & Zarapour, 2000); or a 
process with more restrictive, geometric mixture weights (Fueutes-Garcia, et ah 2009, 2010); 
or dehued by the uormalized iuverse-Gaussiau process (Lijoi, et ah 2005), a geueral type of 
uormalized raudom measure (Regazziui et ah, 2003). 

A more geueral BNP iuhuite-mixture model cau be coustructed by assiguiug its mix¬ 
ture distributiou a covariate-depeudeut BNP prior. Such a BNP mixture model allows the 
eutire depeudeut variable distributiou to chauge hexibly as a fuuctiou of covariate(s). The 
Depeudeut Dirichlet process (DDP; MacEacheru, 1999, 2000, 2001) is a semiual covariate- 
depeudeut BNP prior. Ou the other haud, the iuhuite-probits prior is dehued by a depeudeut 
uormalized raudom measure, coustructed by au iuhuite uumber of covariate-depeudeut mix¬ 
ture weights, with weights specihed by au ordiual probits regressiou with prior distributiou 
assigued to the regressiou coefhcieut aud error variauce parameters (Karabatsos & Walker, 
2012a). 

The applicability of BNP models, for data aualysis, depeuds ou the availability of user- 
frieudly software. This is because BNP models typically admit complex represeutatious, 
which may uot be immediately accessible to uou-experts or begiuuers iu BNP. Gurreutly 
there are two uice commaud-driveu R software packages for BNP mixture modeliug. The 
DPpackage (Jara, et ah, 2011) of R (the R Developmeut Gore Team, 2015) iucludes mauy 
BNP models, mostly DPM models, that provide either hexible regressiou or deusity estima- 
tiou for data aualysis. The package also provides BNP models haviug parameters assigued a 
hexible mixture of huite Polya Trees BNP prior (Hausou, 2006). The bspmma R package 
(Burr, 2012) provides DPM uormal-mixture models for meta-aualysis. 

The existiug packages for BNP modeliug, while impressive, still suggest room for im- 
provemeuts, as summarized by the followiug poiuts. 

1. While the existiug BNP packages provide mauy DPM models, they do uot provide a 
BNP iuhuite-mixture model with mixture distributiou assigued auy oue of the other 
importaut BNP priors meutioued earlier. Priors iuclude those dehued by the Pitmau- 
Yor, uormalized stable, beta, uormalized iuverse-Gaussiau process; or dehued by a 
geometric weights or iuhuite-probits prior. As au exceptiou, the DPpackage provides 
a Pitmau-Yor process mixture of regressious model for iuterval-ceusored data (Jara, et 
al. 2010). 

2. The bspmma R package (Burr, 2012), for meta-aualysis, is limited to DPM models 
that do uot iucorporate covariate depeudeuce (Burr & Doss, 2005). 
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3. The DPpackage handles interval-censored data, bnt does not handle left- or right- 
censored data. 


4. While both BNP packages use MCMC sampling algorithms to estimate the posterior 
distribution of the user-chosen model, each package does not provide a MCMC con¬ 
vergence analysis (e.g., Flegal & Jones, 2011). A BNP package that provides its own 
menu options for MCMC convergence analysis would be, for the user, faster and more 
convenient, and would not require learning a new package (e.g., CODA R package; 
Plummer et ah, 2006) to conduct MCMC convergence analyses. 

5. Both BNP packages do not provide many options to investigate how the posterior 
predictive distribution (and chosen functionals) of the dependent variable, varies as a 
function of one or more covariates. 

6. Generally speaking, command-driven software can be unfriendly, confusing, and time- 
consuming to beginners and to experts. 

In this paper, we introduce a stand-alone and user-friendly software package for BNP 
modeling, which the author constructed using MATLAB Compiler (Natick, MA). This pack¬ 
age, named: Bayesian Regression: Nonparametric and Parametric Models, provides 
BNP data analysis in a fully menu-driven software environment that resembles SPSS (I.B.M., 
2015). 

The software allows the user to mouse-click menu options: 

1. To inspect, describe, and explore the variables of the data set, via basic descriptive 
statistics (e.g., means, standard deviations, quantiles/percentiles) and graphs (e.g., 
scatter plots, box plots, normal Q-Q plots, kernel density plots, etc.); 

2. To pre-process the data of the dependent variable and/or the covariate(s) before in¬ 
cluding the variable(s) into the BNP regression model for data analysis. Examples 
of data pre-processing includes constructing new dummy indicator (0 or 1) variables 
and/or two-way interaction variables from the covariates (variables), along with other 
options to transform variables; and performing a nearest-neighbor hot-deck imputation 
(Andridge & Little, 2010) of missing data values in the variables (e.g., covariate(s)). 

3. To use list and input dialogs to select, in the following order: the Bayesian model 
for data analysis; the dependent variable; covariate(s) (if a regression model was se¬ 
lected); parameters of the prior distribution of the model; the (level-2 and possibly 
level-3) grouping variables (for a multilevel model, if selected); the observation weights 
variable (if necessary; e.g., to set up a meta-analysis); and the variables describing 
the nature of the censored dependent variable observations (if necessary; e.g., to set 
up a survival analysis). The observations can either be left-censored, right-censored, 
interval-censored, or uncensored. Also, if so desired, the user can easily use point-and- 
click to quickly highlight and select a large list of covariates for the model, whereas 
command-driven software requires the user to carefully type (or copy and paste) and 
correctly-verify the long list of the covariates. 
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After the user make these selections, the Bayesian Regression software immediately presents 
a graphic of the user-chosen Bayesian model in the middle of the computer screen, along with 
all of the variables that were selected for this model (e.g., dependent variables, covariate(s); 
see 7^3 above). The explicit presentation of the model is important because BNP models 
typically admit complex representations. In contrast, the command-driven packages do not 
provide immediate on-screen presentations of the BNP model selected by the user. 

Then the software user can click a button to run the MCMC sampling algorithm for the 
menu-selected Bayesian model. The user clicks this button after entering a number of MCMC 
sampling iterations. Immediately after all the MCMC sampling iterations have completed, 
the software automatically opens a text output hie that summarizes the basic results of the 
data analysis (derived from the generated MCMC samples). Results include point-estimates 
of the (marginal) posterior distributions of the model’s parameters, and summaries of the 
model’s predictive ht to the data. Then, the user can click other menu options to produce 
graphical output of the results. They include density plots, box plots, scatter plots, trace 
plots, and various plots of (marginal) posterior distributions of model parameters and ht 
statistics. For each available BNP inhnite-mixture model, the software implements standard 
slice sampling MCMC methods (Kalli, et ah 2011) that are suitable for making inferences 
of the posterior distribution (and chosen functionals) of model parameters. 

Next, after a few mouse-clicks of appropriate menu options, the user can perform a 
detailed MCMC convergence analysis. This analysis evaluates whether a sufhciently-large 
number of MCMC samples (sampling iterations of the MCMC algorithm) has been generated, 
in order to warrant the conclusion that these samples have converged to samples from the 
posterior distribution (and chosen functionals) of the model parameters. More details about 
how to use the software to perform MCMC convergence analysis is provided in Sections 3.1 
and 5.2. 

The software also provides menu options to investigate how the posterior predictive dis¬ 
tribution (and functionals) of the dependent variable changes as a function of covariates. 
Functionals of the posterior predictive distribution include: the mean, median, and quan¬ 
tiles to provide a quantile regression analysis; the variance functional to prove a variance 
regression analysis; the probability density function (p.d.f.) and the cumulative distribution 
function (c.d.f.) to provide a density regression analysis; and the survival function, hazard 
function, and the cumulative hazard function, for survival analysis. The software also pro¬ 
vides posterior predictive inferences for BNP inhnite-mixture models that do not incorporate 
covariates and only focus on density estimation. 

Currently, the Bayesian Regression software provides the user a choice from 83 Bayesian 
models for data analysis. Models include 47 BNP inhnite-mixture regression models, 31 nor¬ 
mal linear models for comparative purposes, and 5 BNP inhnite normal mixture models for 
density estimation. Most of the inhnite-mixture models are dehned by normal mixtures. 

The 47 BNP inhnite-mixture regression models can each handle a dependent variable 
that is either continuous-valued, binary-valued (0 or 1), or ordinal valued (c = 0,1,..., m), 
using either a probit or logit version of this model for a discrete dependent variable; with 
mixture distribution assigned a prior distribution dehned either by the Dirichlet process. 
Pitman-Yor process (including the normalized stable process), beta (2-parameter) process, 
geometric weights prior, normalized inverse-Gaussian process, or an inhnite-probits regres¬ 
sion prior; and with mixing done on either the intercept parameter, or on the intercept and 
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slope coefficient parameters, and possibly on the error variance parameter. Specifically, the 
regression models with mixture distribution assigned a Dirichlet process prior are equivalent 
to ANOVA/linear DDP models, defined by an infinite-mixture of normal distributions, with 
a covariate-dependent mixture distribution defined by independent weights (De lorio, et ah 
2004; Muller et al., 2005). Similarly, the models with mixture distribution, instead, assigned 
a different BNP prior distribution (process) mentioned above, implies a covariate-dependent 
version of that process. See Section 3.1 for more details. Also, some of the infinite-mixture 
regression models, with covariate-dependent mixture distribution assigned a infinite-probits 
prior, have spike-and-slab priors assigned to the coefficients of this BNP prior, based on 
stochastic search variable selection (SSVS) (George & McCulloch, 1993, 1997). In addition, 
the 5 BNP infinite normal mixture models, for density estimation, include those with mix¬ 
ture distribution assigned a BNP prior distribution that is defined by either one of the 5 
BNP process mentioned above (excluding infinite-probits). 

Finally, the 31 Bayesian linear models of the Bayesian regression software include ordinary 
linear models, 2-level, and 3-level normal random-effects (or HLM) models, for a continuous 
dependent variable; probit and logit versions of these linear models for either a binary (0 or 
1) or ordinal (c = 0,1,... ,m) dependent variable; and with mixture distribution specified 
for the intercept parameter, or for the intercept and slope coefficient parameters. 

The outline for the rest of the paper is as follows. Section 2 reviews the Bayesian in¬ 
ference framework. Appendix A reviews the basic probability theory notation and concepts 
that we use. In Section 3, we define two key BNP infinite-mixture regression models, each 
with mixture distribution assigned a BNP prior distribution on the space of probability mea¬ 
sures. The other 50 BNP infinite-mixture models of the Bayesian regression software are 
extensions of these two key models, and in that section we give an overview of the various 
BNP priors mentioned earlier. In that section we also describe the Bayesian normal linear 
model, and a Bayesian normal random-effects linear model (HLM). Section 4 gives step-by- 
step software instructions on how to perform data analysis using a menu-chosen, Bayesian 
model. Section 5 illustrates the Bayesian regression software through the analysis of a real 
data set, using each of the two key BNP models, and a Bayesian linear model. Appendix B 
provides a list of exercises that the software user can work through in order to practice BNP 
modeling on several example data sets, available from the software. These data-analysis 
exercises address applied problems in prediction analysis, categorical data analysis, causal 
analysis, meta-analysis, survival analysis of censored data, spatial data analysis, time-series 
analysis, and item response theory analysis. Section 6 ends with conclusions. 

2 Overview of Bayesian Inference 

In a given research setting where it is of interest to apply a regression data analysis, 
a sample data set is of the form = {(^/i, Xj)}"^^. Here, n is the sample size of the 
observations, respectively indexed by i = where Ui is the ith observation of the 

dependent variable Yi, corresponding to an observed vector of p observed covariates^ x* = 
(l,Xji, • • • yXipY- A constant (1) term is included in x for future notational convenience. 

A regression model assumes a specific form for the probability density (or p.m.f.) function 

^For each Bayesian model for density estimation, from the software, we assume x = 1. 
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/(?/1 x; conditionally on covariates x and model parameters denoted by a vector, G 
where = {<^} is the parameter space. For any given model parameter value C ^ 
density /(|/i | xp C) is the likelihood of yi given Xj, and L(F>„; C) = IY^=i fiVil^uC) is the 
likelihood of the full data set under the model. A Bayesian regression model is completed 
by the specification of a prior distribution (c.d.f.) n(^) over the parameter space and 
7r(^) gives the corresponding probability density of a given parameter G 

According to Bayes’ theorem, after observing the data Vn = {(l/i, the plausible 

values of the model parameter is given by the posterior distribution. This distribution 
dehnes the posterior probability density of a given parameter E by: 

nr=./teix.;Odn(c) 

/ nr.i /(k I x.; C)dn(c) 

J 

Conditionally on a chosen value of the covariates x = (l,a:i,... yXpY, the posterior predic¬ 
tive density of a future observation yn+i, and the corresponding posterior predictive c.d.f. 
{F{y I x)), mean (expectation, E), variance (V), median, uth. quantile {Q{u \ x), for some cho¬ 
sen u G [0,1], with Q(.5 I x) the conditional median), survival function (S'), hazard function 
(Sf), and cumulative hazard function (A), is given respectively by: 


fn{y 1 x) 

= |/(9|xiC)dn(c|c„), 

(2a) 

Fn{y 1 x) 

= f /( 9 |x;C)dn(C|r>„), 

(2b) 


Y<y 


E.(r|x) 

= j y<^Fn{y\^), 

(2c) 

v.(y|x) 

= jiy~ 1 1 x). 

(2d) 

Qniu 1 x) 


(2e) 

Sniy 1 x) 

1 

3 

(2f) 

Hn{y 1 x) 

= /n(2/|x)/{l-F„(2/|x)}, 

(2g) 

^niy 1 x) 

= -iog{i-T;(i/|x)}. 

(2h) 


Depending on the choice of posterior predictive functional from (2), a Bayesian regression 
analysis can provide inferences in terms of how the mean (2c), variance (2d), quantile (2e) 
(for a given choice u G [0,1]), p.d.f. (2a), c.d.f. (2b), survival function (2f), hazard function 
(2g), or cumulative hazard function (2h), of the dependent variable Y, varies as a function of 
the covariates x. While the mean functional E„(y | x) is conventional for applied regression, 
the choice of functional V„(y|x) pertains to variance regression; the choice of function 
Qn{u I x) pertains to quantile regression; the choice of p.d.f. fn{y \ x) or c.d.f. Fn{y \ x) 
pertains to Bayesian density (distribution) regression; and the choice of survival Sn{y \ x) or 
a hazard function {Hniy \ x) or A„(|/1 x)) pertains to survival analysis. 

In practice, the predictions of the dependent variable Y (for a chosen functional from 
(2)), can be easily viewed (in a graph or table) as a function of a subset of only one or 
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two covariates. Therefore, for practice we need to consider predictive methods that involve 
snch a small subset of covariates. To this end, let be a focal subset of the covariates 
(xi,..., Xp), with X 5 also including the constant ( 1 ) term. Also, let x^ be the non-focal, 
complement set of q (unselected) covariates. Then X 5 fl xc 7 ^ 0 and x = X 5 U x^. 

It is possible to study how the predictions of a dependent variable Y varies as a function 
of the focal covariates x^, using one of four automatic methods. The first two methods 
are conventional. They include the grand-mean centering method, which assumes that the 
non-focal covariates x^ is defined by the mean in the data with x^ := “ 
and the zero-centering method, which assumes that the non-focal covariates are given by 
xc := Og where 0 ^ is a vector of q zeros. Both methods coincide if the observed covariates 
{xj = (1, Xii, ..., in the data Vn have average (1,0,..., 0 )'''. This is the case if the 

covariate data {xifc}"=i have already been centered to have mean zero, for /c = 1 ,... ,p. 

The partial dependence method (Friedman, 2001, Section 8.2) is the third method for 
studying how the predictions of a dependent variable Y varies as a function of the focal 
covariates x^. In this method, the predictions of Y, conditionally on each value of the 
focal covariates X 5 , are averaged over data (P„) observations (and effects) of the 

non-focal covariates x^. Specifically, in terms of the posterior predictive functionals ( 2 ), the 
averaged prediction of Y, conditionally on a value of the covariates x^, is given respectively 
by: 


fn{y 

X5) 

Fn{y 

xs) 

E„(r 

Xs) 

V.(F 

xs) 

Qniy^ 

xs) 

Sn{y 

Xs) 

Hn{y 

Xs) 

K{y 

Xs) 


^Er=i/n(2/|x5,xci), 

^Er=i^ri(|/|x5,xci), 

^Er=lEn(b"|xs,Xci), 

^Er=l®^n(^|xS,Xci), 

^Er=l^n^(w|xs,Xci), 

^Er=l{l -^n(|/|x5,Xci)}, 

^Er=l^n(2/|x5,Xci), 

I Er=l - -Fn{y\^S, Xci)}. 


(3a) 

(3b) 

(3c) 

(3d) 

(3e) 

(3f) 

(3g) 

(3h) 


The equations above give, respectively, the (partial dependence) posterior predictive den¬ 
sity, c.d.f., mean, variance, quantile (at u G [0,1]), survival function, hazard function, and 
cumulative hazard function, of Y, conditionally on a value X 5 of the focal covariates. As a 
side note pertaining to causal analysis, suppose that the focal covariates include a covari¬ 
ate, denoted T, along with a constant (1) term, so that X 5 = (l,t). Also suppose that 
the covariate T is a binary-valued (0,1) indicator of treatment receipt, versus non-treatment 
receipt. Then the estimate of a chosen (partial-dependence) posterior predictive functional 
of Y under treatment (T = 1) from (3), minus that posterior predictive functional under 
control (T = 0), provides an estimate of the causal average treatment effect (CATE). This is 
true provided that the assumptions of unconfoundedness and overlap hold (Imbens, 2004). 

The partial-dependence method can be computationally-demanding, as a function of 
sample size (n), the dimensionality of X 5 , the number of X 5 values considered when inves¬ 
tigating how Y varies as a function of X 5 , and the number of MCMC sampling iterations 
performed for the estimation of the posterior distribution (density ( 1 )) of the model pa¬ 
rameters. In contrast, the clustered partial dependence method, the fourth method, is less 



computationally-demanding. This method is based on forming J’C-means cluster centroids, 
of the data observations {'x.ei}'i=i of fhe non-focal covariates x^, with K = 
clusters as a rule-of-thumb. Then the posterior predictions of y, conditionally on chosen 
value of the covariate subset x^, is given by any one of the chosen posterior functionals (3) 
of interest, after replacing ^ with ^ xa with xct- 

The predictive fit of a Bayesian regression model, to a set of data, Vn = {(?/*,Xj)}”^^, 
can be assessed on the basis of the posterior predictive expectation (2c) and variance (2d). 
First, the standardized residual fit statistics of the model are defined by: 


yi - E^(y I Xj) 

v/Vn(y I X,) ’ 


i = 1,... ,n. 


(4) 


An observation yi can be judged as an outlier under the model, when its absolute standardized 
residual | r* | exceeds 2 or 3. The proportion of variance explained in the dependent variable 
y, by a Bayesian model, is measured by the R-squared statistic: 


Also, suppose that it is of interest to compare M regression models, in terms of predictive 
ht to the given data set Vn- Models are indexed by m = 1,..., M, respectively. For each 
model m, a global measure of predictive fit is given by the mean-squared predictive error 
criterion: 

= YTi=l{yi - ^niy I Xi,m)}^ -h I (6) 

(Laud & Ibrahim, 1995; Gelfand & Ghosh, 1998). The first term in (6) measures model 
goodness-of-fit to the data and the second term is a model complexity penalty. Among 
a set of M regression models compared, the model with the best predictive ht for the data 
Vn is identihed as the one that has the smallest value of D{m). 


2.1 MCMC Methods 

In practice, a typical Bayesian model does not admit a closed-form solution for its pos¬ 
terior distribution (density function of the form (1)). However, the posterior distribution, 
along with any function of the posterior distribution of interest, can be estimated through 
the use of Monte Garlo methods. In practice, they usually involve Markov Ghain Monte 
Garlo (MGMG) methods (e.g.. Brooks et ah, 2011). Such a method aims to construct a 
discrete-time Harris ergodic Markov chain with stationary (posterior) distribution 

n(CI’Pn), and ergodicity is ensured by a proper (integrable) prior density function 7r(^) 
(Robert & Gasella, 2004, Section 10.4.3). A realization from the Markov chain can be 
generated by hrst specifying partitions (blocks) Cb (^ = 1) • • •) -S) of the model’s parameter 
C, and then simulating a sample from each of the full conditional posterior distributions 
n(Cfe I "^n, Cc, 7 ^ 5), in turn for 6 = l,...,i?. Then, as S' —)■ oo, the Markov (MGMG) 
chain converges to samples from the posterior distribution n(((^ | Vn)- Therefore, in 

practice, the goal is to construct a MGMG chain (samples) for a sufficiently-large 

hnite S. 
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MCMC convergence analyses can be performed in order to check whether a sufficiently- 
large nnmber (S) of sampling iterations has been run, to warrant the conclusion that the 
resulting samples have converged (practically) to samples from the model’s pos¬ 

terior distribution. Such an analysis may focus only on the model parameters of interest for 
data analysis, if so desired. MCMC convergence can be investigated in two steps (Geyer, 
2011). One step is to inspect, for each of these model parameters, the univariate trace plot 
of parameter samples over the MCMC sampling iterations. This is done to evaluate MCMC 
mixing, i.e., the degree to which MCMC parameter samples explores the parameter’s support 
in the model’s posterior distribution. Good mixing is suggested by a univariate trace plot 
that appears stable and ’’hairy” over MCMC iterations^. The other step is to conduct, for 
each model parameter of interest, a batch means (or subsampling) analysis of the MCMC 
samples, in order to calculate 95% Monte Carlo Conhdence Intervals (95% MCCIs) of poste¬ 
rior point-estimates of interest (such as marginal posterior means, variances, quantiles, etc., 
of the parameter) (Flegal & Jones, 2011). For a given (marginal) posterior point-estimate of 
a parameter, the 95% MCCI half-width size reflects the imprecision of the estimate due to 
Monte Carlo sampling error. The half-width becomes smaller as number of MCMC sampling 
iterations grows. In all, MCMC convergence is conhrmed by adequate MCMC mixing and 
practically-small 95% MCCIs half-widths (e.g., .10 or .01) for the (marginal) posterior point- 
estimates of parameters (and chosen functionals) of interest. If adequate convergence cannot 
be conhrmed after a MCMC sampling run, then additional MCMC sampling iterations can 
be run until convergence is obtained for the (updated) total set of MCMC samples. 

For each BNP inhnite-mixture model, the Bayesian Regression software estimates the 
posterior distribution (and functionals) of the model on the basis of a general slice-sampling 
MCMC method, which can handle the inhnite-dimensional model parameters (Kalli, et al., 
2011). This slice-sampling method does so by introducing latent variables into the likelihood 
function of the inhnite-mixture model, such that, conditionally on these variables, the model 
is hnite-dimensional and hence tractable by a computer. Marginalizing over the distribution 
of these latent variables recovers the original likelihood function of the inhnite-mixture model. 

We now describe the MCMC sampling methods that the software uses to sample from the 
full conditional posterior distributions of the parameters, for each model that the software 
provides. For each DPM model, the full conditional posterior distribution of the unknown 
precision parameter (a) is sampled from a beta mixture of two gamma distributions (Es¬ 
cobar & West, 1995). For each BNP inhnite-mixture model based on a DP, Pitman-Yor 
process (including the the normalized stable process), or beta process prior, the full condi¬ 
tional posterior distribution of the mixture weight parameters are sampled from appropriate 
beta distributions (Kalli, et al., 2011). Also, for the parameters of each of the 31 lin¬ 
ear models, and for the linear parameters of each of the BNP inhnite-mixture models, the 
software implements (direct) MCMC Gibbs sampling of standard full conditional posterior 
distributions, derived from the standard theories of the Bayesian normal linear, probit, and 
logit models, as appropriate (Evans, 1965; Findley & Smith, 1972; Gilks et al., 1993; Al¬ 
bert & Chib, 1993; Bernardo & Smith, 1994; Denison et al., 2002; Cepeda & Gamerman, 
2001; O’Hagan & Forster, 2004; Holmes & Held, 2006; George & McCulloch, 1997; e.g., 

^The CUSUM statistic, which ranges between 0 and 1, is a measure of the ’’hairiness” of a univariate 
trace plot of a model parameter (Brooks, 1998). A CUSUM value of .5 indicates optimal MCMC mixing. 
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see Karabatsos & Walker, 2012a,b). When the full conditional posterior distribution of the 
model parameter(s) is non-standard, the software implements a rejection sampling algo¬ 
rithm. Specifically, it implements an adaptive random-walk Metropolis-Hastings (ARWMH) 
algorithm (Atchade & Rosenthal, 2005) with normal proposal distribution, to sample from 
the full conditional posterior distribution(s) of the mixture weight parameter of a BNP geo¬ 
metric weights infinite-mixture model; the mixture weight parameter of a BNP normalized 
inverse-Gaussian process mixture model, using the equivalent stick-breaking representation 
of this process (Favaro, et ah 2012). Also, for BNP inhnite-mixture models and normal 
random-effects models that assign a uniform prior distribution to the variance parameter for 
random intercepts (or means), the software implements the slice sampling (rejection) algo¬ 
rithm with stepping-out procedure (Neal, 2003), in order to sample from the full conditional 
posterior distribution of this parameter. Finally, for computational speed considerations, 
we use the ARWMH algorithm instead of Gibbs sampling, in order to sample from the full 
conditional posterior distributions for the random coefficient parameters (the intercepts uoh'-, 
and possibly the Ukh, /c = 0 , 1 ,... ,p, as appropriate, for groups h = 1 ,..., iP) in a normal 
random-effects (or random intercepts) HLM; and for the random coefficients {(3^) or ran¬ 
dom intercept parameters (/Soj) iii a BNP infinite-mixture regression model, as appropriate 
(Karabatsos & Walker, 2012a,b). 

The given data set {T>n) may consist of censored dependent variable observations (either 
left-, right-, and/or interval-censored). If the software user indicates the censored dependent 
variable observations (see Section 4.2, Step 6 ), then the software adds a Gibbs sampling step 
to the MGMG algorithm, that draws from the full-conditional posterior predictive distribu¬ 
tions (density function (2a)) to provide multiple MGMG-based imputations of these missing 
censored observations (Gelfand, et ah 1992; Karabatsos & Walker, 2012a). 

Finally, the software implements Rao-Blackwellization (RB) methods (Gelfand & Mukhopad- 
hyay, 1995) to compute estimates of the linear posterior predictive functionals from (2) and 
(3). In contrast, the quantile functional Qn{u \ x) is estimated from order statistics of MGMG 
samples from the posterior predictive distribution of Y given x. The 95% posterior credible 
interval of the quantile functional Q{u\x.) can be viewed in a PP-plot (Wilk & Gnanade- 
sikan, 1968) of the 95% posterior interval of the c.d.f. F{u \ x), using available software menu 
options. The hazard functional Hn{y \ x) and the cumulative hazard functional An{y \ x) are 
derived from RB estimates of the linear functionals fn{y \ x) and Fn{y \ x). The same is true 
for the partial-dependence functionals Qniu \ x^), Hniy \ x^), and \ x^). 


3 Key BNP Regression Models 


A BNP infinite-mixture regression model has the general form: 


/gx(i/|x;C) 


/ 


fiy I x; -0, 0(x))dGx(0) = ^ fiy 

1=1 


x;'0,^j(x))wj(x). 


(7) 


given a covariate (x) dependent, discrete mixing distribution Gx; kernel (component) densi¬ 
ties 

/(y I x;-0, 0j(x)) with component indices j = 1 , 2 ,..., respectively; with fixed parameters 
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i/?; and with component parameters Oj(x.) having sample space 0; and given mixing weights 
that sum to 1 at every x G df, with X the covariate space. 

In the inhnite-mixture model (7), the covariate-dependent mixing distribution is a random 
probability measure that has the general form^, 

CX) 

Gx(i?) = 5^a;,(x)50^,(,)(i?), Vi? G S(0), (8) 

i=i 

and is therefore an example of a species sampling model (Pitman, 1995). 

The mixture model (7) is completed by the specification of a prior distribution n(^) on 
the space = {<^} of the inhnite-dimensional model parameter, given by: 

C = (tA, (0i(x),u;j(x))“i,x G X). (9) 

The BNP inhnite-mixture regression model (7)-(8), completed by the specihcation of a prior 
distribution n(^), is very general and encompasses, as special cases: hxed- and random- 
effects linear and generalized linear models (McCullagh & Nelder, 1989; Verbeke & Molen- 
berghs, 2000; Molenberghs & Verbeke, 2005), hnite-mixture latent-class and hierarchical 
mixtures-of-experts regression models (McLachlan & Peel, 2000; Jordan & Jacobs, 1994), 
and inhnite-mixtures of Gaussian process regressions (Rasmussen & Ghahramani, 2002). 

In the general BNP model (7)-(8), assigned prior n(^), the kernel densities f{y \ x; -0, 0j(x)) 
may be specihed as covariate independent, with: f{y \ x; ij), 0j(x)) := f{y \ ij), Oj)] and may 
not contain hxed parameters ij), in which case '?/’ is null. Also for the model, covariate depen¬ 
dence is not necessarily specihed for the mixing distribution, so that '■= G. No covariate 
dependence is specihed for the mixing distribution if and only if both the component param¬ 
eters and the mixture weights are covariate independent, with Oj{x) := Oj and cuj(x) '■= ooj- 
The mixing distribution G^ is covariate dependent if the component parameters 0j(x) or 
the mixture weights cjj(x) are specihed as covariate dependent. 

Under the assumption of no covariate dependence in the mixing distribution, with G^ ■ = 
G, the Dirichlet process (Ferguson, 1973) provides a standard and classical choice of BNP 
prior distribution on the space of probability measures Qe = {G}e on the sample space 
0. The Dirichlet process is denoted W{a,GQ) with precision parameter a and baseline 
distribution (measure) Gq. We denote G ~ VV{a, Gq) when the random probability measure 
G is assigned a VV{a, Gq) prior distribution on Qq. Under the VV{a, Fq) prior, the (prior) 
mean and variance of G are given respectively by (Ferguson, 1973): 

E[G(i?) I a. Go] = -^ = Go(i?), (10a) 

a 

V[G(R)|a,Go] = - Go{B)] ^ VR G S(0). (10b) 

a -f 1 

For the VV{a, Go) prior, (10a) shows that the baseline distribution Go represents the prior 
mean (expectation) of G, and the prior variance of G is inversely proportional to the precision 
parameter a, as shown in (10b). The variance of G is increased (decreased, resp.) as a 

^Throughout, Sel-) denotes a degenerate probability measure (distribution) with point mass at 0, such 
that 6* ~ Se and Pr(0* = 0) = 1. Also, Se{B) = 1 ii 6 G B and 5e{B) = 0 if 0 ^ i3, for \/B € B{&). 
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becomes smaller (larger, resp.). In practice, a standard choice of baseline distribution Go(') 
is provided by the normal N(p,,cr^) distribution. The VV{a,Go) can also be characterized 
in terms of a Dirichlet (Di) distribution. That is, if G ~ W{a,GQ), then: 

(G(fii),..., G{Bk)) I a. Go ~ Di(aGo(5i),..., aGo{Bk)), (11) 

for every choice of fc > 1 (exhaustive) partitions Bi,... ,Bk of the sample space, 0. 

The W{a, Go) can also be characterized as a particular ’’stick-breaking” stochastic pro¬ 
cess (Sethuraman, 1994; Sethuraman & Tiwari, 1982). A random probability measure (G) 
that is drawn from the VV{a,Go) prior, with G ~ W{a,Go), is constructed by hrst tak¬ 
ing independently and identically distributed (i.i.d.) samples of {v, 0) from the following 
distributions: 


I a ~ Be(l, a), j = l, 2 ,..., ( 12 a) 

6j I Go ~ Go, j = 1 , 2 ,..., ( 12 b) 

and then using the samples {vj,6j)'^^ to construct the random probability measure by: 

OO 

G{B) = Y,^M{B), V5ei3(0). (12c) 

Above, the UjS are mixture weights, particularly, stick-breaking weights constructed by: 

for j = 1 , 2 ,..., ( 12 d) 

1=1 

and they sum to 1 (i.e., — !)• 

More in words, a random probability measure, G, drawn from a W{a,GQ) prior dis¬ 
tribution on Qe = {Gje, can be represented as inhnite-mixtures of degenerate probability 
measures (distributions). Such a random distribution is discrete with probability 1, which 
is obvious because the degenerate probability measure ((5e (■)) is discrete. The locations Oj 
of the point masses are a sample from Gq. The random weights ujj are obtained from a 
stick-breaking procedure, described as follows. First, imagine a stick of length 1 . As shown 
in ( 12 d), at stage j = 1 a piece is broken from this stick, and then the value of the hrst weight 
oji is set equal to the length of that piece, with oji = vi. Then at stage j = 2, a piece is 
broken from a stick of length 1 —cji, and then the value of the second weight bj 2 = 1 ^ 2(1 ~'i^i) 
is set equal to the length of that piece. This procedure is repeated for j = 1, 2, 3,4,..., where 
at any given stage j, a piece is broken from a stick of length 1 — then the value 

of the weight Uj is set equal to the length of that piece, with ujj = vj nP.(l-«i)- The 
entire procedure results in weights that sum to 1 (almost surely). 

The stick-breaking construction ( 12 ) immediately suggests generalizations of the W{a, Go), 
especially by means of increasing the flexibility of the prior ( 12 a) for the random parameters 
{vj)’^i that construct the stick-breaking mixture weights (12d). One broad generalization is 
given by a general stick-breaking process (Ishwaran & James, 2001), denoted iSS(a, b,Go) 
with positive parameters a = ( 01 , 02 ,...) and b = (^ 1 , 62 ,...), which gives a prior on Qq 
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= This process replaces the i.i.d. beta distribution assumption in (12a), with the 

more general assumption of independent beta distributions, with 

Vj\a ^ Be{aj, bj), for j = 1, 2,.... (13) 

In turn, there are many interesting special cases of the SB{a, b, Gq) process prior, including: 

1. The Pitman-Yor (Poisson-Dirichlet) process, denoted Vy{a, b, Gq), which assumes aj = 
1 — a and bj = b + ja, for j = 1, 2,... , in (13), with 0 < a < 1 and b > —a (Perman, 
et ah 1992; Pitman & Yor, 1997). 

2. The beta two-parameter process, which assumes aj = a and bj = b in (13) (Ishwaran 
& Zarepour, 2000). 

3. The normalized stable process (Kingman, 1975), which is equivalent to the Vy{a, 0, Go) 
process, with 0 < a < 1 and 6 = 0 . 

4. The Dirichlet process PP(q:,Go), which assumes aj = 1 and bj = a in (13), and with 
is equivalent to the Vy{0, a, Go) process. 

5. The geometric weights prior, denoted ^W(a, b, Gq), which assumes in (13) the equality 
restriction v = Vj for j = 1 , 2 ,..., leading to mixture weights ( 12 d) that can be re¬ 
written as ujj = V {1 — vy~^, for j = 1, 2,... (Fuentes-Garcia, et ah 2009, 2010). These 
mixture weights may be assigned a beta prior distribution, with v ~ Be (a, b). 

Another generalization of the W{a, Gq) is given by the mixture of Dirichlet process (MDP), 
defined by the stick-breaking construction ( 12 ), after sampling from prior distributions a ~ 
n (a) and i? ~ IT {'&) for the precision and baseline parameters (Antoniak, 1974). 

A BNP prior distribution on Qq = {Gje, dehned by a Normalized Random Measure 
(NRM) process, assumes that a discrete random probability measure G, given by (12c), is 
constructed by mixture weights that have the form: 

j > 7 = 1,2 ,...; ujj> 0, = T (14) 

1 ^ 1 = 1 1 ; 

The Ji, J 2 ,/ 3 ,... are the jump sizes of a non-Gaussian Levy process whose sum is almost 
surely finite (see e.g. James et ah, 2009), and are therefore stationary independent increments 
(Bertoin, 1998). The DP(a,Go) is a special NRM process which assumes that Xljli~ 
Ga(Q;, 1) (Ferguson, 1973, pp. 218-219). 

An important NRM is given by the normalized inverse-Gaussian N'XQ{c,Gq) process 
(Lijoi et ah, 2005), which can be characterized as a stick-breaking process (Favaro, et ah, 
2012 ), defined by the stick-breaking construction ( 12 ), after relaxing the i.i.d. assumption 
( 12 a), by allowing for dependence among the Vj distributions, with: 



= 7 . . = 1 , 2 ,... 

Vlj + VQj 

(15a) 

Vij 

~ GiG(c^/{n’r;(i - 1 . -i/ 2 ), 

(15b) 


~ IG(1/2,2). 

(15c) 
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The random variables (15a) follow normalized generalized inverse-Gaussian distributions, 
with p.d.f. given by equation (4) in Favaro, et ah (2012). 

Stick-breaking process priors can be characterized in terms of the clustering behavior 
that it induces in the posterior predictive distribution of 6. Let {0* : c = 1,..., < 

n} be the kn < n unique values (clusters) among the n observations of a data set. Let 
T„ = {Cl,..., Cc,..., Cfe^} be the random partition of the integers {1,..., n}. Each cluster 
is defined by Cc = {i '■ 6i = 0*} C {1,..., n}, and has size Uc = |Cc|, with cluster frequency 
counts n„ = (ui,..., ric,..., n^^) and X)c=i = n. 

When G is assigned a Pitman-Yor Vy{a, b, Gq) process prior, the posterior predictive 
probability of a new observation 0„+i is dehned by: 

7 I 7 

= VBeB(e). ( 16 ) 

b + n 0 -|- n 


That is, 0„+i forms a new cluster with probability {b + akn)/ib + n), and otherwise with 
probability (uc — a)/{b + n), 0„+i is allocated to old cluster Cc, for c = 1,..., kn. Recall 
that the normalized stable process (Kingman, 1975) is equivalent to the Vy{a, 0, Gq) process 
with 0 < a < 1 and 5 = 0; and the VV{a, Gq) is the Vy{0, b, Go) process with a = 0 and 
b = a. 

Under the AfXQ{c,Go) process prior, the posterior predictive distribution is defined by 
the probability function, 

kn 

P{yn +1 G R I 01 ,..., 0„) = w^^^GoiB) + J^iuc - .5)(5,*(R), WB G B{Q), (17a) 

C=1 


with: 

E {-c^)-^^^T{kn + 1 + 2/ - 2n ; c) 

.,,G) = 1=0 \ I-/ _ 

° /n - 1\ 

2^ E f I ] (-c2)-T(A;„ + 2 + 21 - 2n] c) 

= ^=0 VW _ 

^ ^-1 Cn-1\ 

^Ef I ]{-y)~y{kn + 2 + 2l-2n; c) 


(17b) 


(17c) 


where r(-; •) is the incomplete gamma function (Lijoi et ah, 2005, p. 1283). Finally, ex¬ 
changeable partition models (e.g., Hartigan, 1990; Barry & Hartigan, 1993; Quintana & 
Iglesias, 2003) also give rise to random clustering structures of a form (17), and therefore co¬ 
incide with the family of Gibbs-type priors, which include the Vy{a, b, Gq) and AfXQ{c, Gq) 
processes and their special cases. More detailed discussions on the clustering behavior in¬ 
duced by various BNP priors are given by De Blasi, et ah (2015). 

So far, we have described only BNP priors for the mixture distribution ( 8 ) of the general 
BNP regression model (7), while assuming no covariate dependence in the mixing distribu¬ 
tion, with Gx := G. We now consider dependent BNP processes. A seminal example is 
given by the Dependent Dirichlet process ('D'DP(ax,G qx)) (MacEachern 1999; 2000; 2001), 
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which models a covariate (x) dependent process Gx, by allowing either the baseline dis¬ 
tribution Gqx, the stick-breaking mixture weights cjj(x), and/or the precision parameter 
Ox to depend on covariates x. In general terms, a random dependent probability measure 
Gx|ax,Gox ~ Gox) can be represented by Sethuraman’s (1994) stick-breaking 

construction, as: 


Gx{B) 

= V5ei3(0), 

(17d) 

UJj{x) 

= ^^i(x)ni=l(i-^^fc(x)), 

(17e) 

^^i(x) 

: A — )■ [0,1], 

(17f) 


~ Qxjy 

(17g) 

^i(x) 

~ Gox- 

(17h) 


Next, we describe an important BNP regression model, with a dependent mixture distribu¬ 
tion Gx assigned a specific WP(ax:, Gqx) prior. 

3.1 ANOVA-Linear DDP model 

Assume that the data T>n = {(Pi, can be stratified into N/^ groups, indexed by 

h = 1,..., N/i, respectively. For each of group h, let pi(h) be the ith dependent observation 
of group h, and let yh = (Pi(h))^(h)=i be the column vector of rih dependent observations, 
corresponding to an observed design matrix X/^ = (x^^^,... ... ,xT^) of rih rows of 

covariate vectors xj^^^ respectively. Possibly, each of the groups of observations has only 
one observation (i.e., = 1), in which case = n. 

The ANOVA-linear DDP model (De lorio, et al. 2004; Miiller et ah, 2005) can be defined 
as: 


~ /(y,|X,;C), h = l,...,N, (18a) 

f{yh\^h,C) = n ( 18 b) 

i=i l4D=i J 

(18c) 

Vj\a ~ Be(l,a) (18d) 

/3j |/x,T ~ N(/^,T) (18e) 

~ IG(ao/2,ao/2) (18f) 

fi,T N(^i|0,roIp+i)IW(T|p-F3, solp+i) (18g) 

a ~ Ga.{aa,ba). (18h) 


Therefore, all the model parameters are assigned prior distributions, which together, define 
the joint prior p.d.f. for ^ G D,; by: 

OO 

7r(C) = JJbe(r;j I l,a)n(/3j I/j.,T)ig(a^ I ao/2,ao/2) (19a) 

i=i 

xn{n I 0,roIp+i)iw(T \p + 3, soIp+i)ga(a | a^, b^). (19b) 
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As shown, the ANOVA-linear DDP model (18) is based on a mixing distribution G{(3) 
assigned a W{a, Gq) prior, with precision parameter a and multivariate normal baseline 
distribution, Go(') := T). Prior distributions are assigned to in order 

to allow for posterior inferences to be robust to different choices of the W{a,Go) prior 
parameters. 

The ANOVA-linear DDP model (18) is equivalent to the BNP regression model (7), 
with normal kernel densities n{yi\ and mixing distribution Gx(h) (8) assigned a 

DD'P(a,Gox) prior, where: 

GAB) = E,"i^.<^xT/3(i?), Vi? e B{Q), (20) 

with f3j I ~ N(/j,, T) and ~ IG(ao/2, ao/2) (i.e., Go(-) = N(/3 \fi, T)lG{a^ |ao/2, ao/2)), 
and with the Uj stick-breaking weights (18c) (De lorio, et ah 2004). 

A menu option in the Bayesian regression software labels the ANOVA-linear DDP 
model (18) as the ’’Dirichlet process mixture of homoscedastic linear regressions model” (for 
Step 8 of a data analysis; see next section). The software allows the user to analyze data 
using any one of many variations of the model (18). Variations of this DDP model include: 
’’mixture of linear regressions” models, as labeled by a menu option of the software, with 
mixing distribution G{f3, cr'^) for the coefficients and the error variance parameters; ’’mixture 
of random intercepts” models, with mixture distribution G(/5 q) for only the intercept pa¬ 
rameter /3 q, and with independent normal priors for the slope coefficient parameters {l3Ai\=i] 
mixture models having mixture distribution G assigned either a Pitman-Yor Vy{a, b, Gq) (in¬ 
cluding the normalized stable process prior), beta process, geometric weights, or normalized 
inverse-Gaussian process J\fXQ{c,Go) prior, each implying, respectively, a dependent BNP 
prior for a covariate-dependent mixing distribution ( 20 ) (using similar arguments made for 
the DDP model by De lorio, et ah 2004); and mixed-logit or mixed-probit regression models 
for a binary (0 or 1 ) or ordinal (c = 0 , 1 ,... , m) dependent variable. Also, suppose that the 
ANOVA-linear DDP model (18) is applied to time-lagged dependent variable data (which 
can be set up using a menu option in the software; see Section 4.1, Step 3, and Section 
4.3). Then this model is dehned by an inhnite-mixture of autoregressions, with mixture 
distribution assigned a time-dependent DDP (Di Lucca, et ah 2012). The Help menu of the 
software provides a full list of models that are available from the software. 

3.2 Infinite-Probits Mixtnre Linear Model 

As mentioned, typical BNP inhnite-mixture models assume that the mixture weights have 
the stick-breaking form (12d). However, a BNP model may have weights with a different 
form. The inhnite-probits model is a Bayesian nonparametric regression model (7)-(8), with 
prior n(<^), and with mixture distribution ( 8 ) dehned by a dependent normalized random 
measure (Karabatsos & Walker, 2012). 

For data, Vn = {(?/*, Xj)}”^^, a Bayesian inhnite-probits mixture model can be dehned 
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by: 


yi\y^i 

/( 2 /|x;C) 


^j(x) 

I 

(Tf, 

/^ok^ 

cr^ 

2 


~ /( 2 /|xi;C), i = l,...,n 

OO 

= E n(2/|/ij+xT/3,cr^)a;j(x) 

j = -QO 

= !> _ !> f 

V / V 

~ N(0,<tJ) 

~ U( 0 , 6 cr/i) 

N(0, (y^Vffd —f oo) 

N(0,cr^u), fc —l,...,p 

^ IG(ao/2, ao/2) 

~ N(0,ctJkJ) 

~ IG(o„/2,o„/2), 


( 21 a) 

( 21 b) 

( 21 c) 

( 21 d) 

(21e) 

( 21 f) 

( 21 g) 

( 21 h) 

( 21 i) 

( 21 j) 


with $ (■) the normal N(0,1) c.d.f., and model parameters — ((t(j)^i,fr^,/3, fr^,/3^,frjj) 
assigned a prior n(^) with p.d.f.: 


cx> 

7 r(C) = JJ n(/i^-I 0 ,aJ)u((T^ I 0,6<^^)n(/3 I 0 ,kdiag(u^ooo,yJp)) ( 22 a) 

j = -(X 

xig(k I ao/ 2 , ao/2)n(/3^ | 0, ku^Ip+i)ig(k I aa;/ 2 , a^/ 2 ), ( 22 b) 


where Jp denotes a p x 1 vector of Is. 

The Bayesian Regression software labels the BNP model (21) as the ’’Infinite ho- 
moscedastic probits regression model,” in a menu option (in Step 8 of a data analysis; see 
next section). This model is defined by a highly-flexible robust linear model, an inhnite 
mixture of linear regressions ( 21 b), with random intercept parameters /i^ modeled by in¬ 
hnite covariate-dependent mixture weights (21c). The model (21) has been extended and 
applied to prediction analysis (Karabatsos & Walker, 2012a), meta-analysis (Karabatsos et 
ah 2015), (test) item-response analysis (Karabatsos, 2015), and causal analysis (Karabatsos 
& Walker, 2015). 

The covariate-dependent mixture weights Wj/x) in (21c), dehning the mixture distribution 
( 8 ), are modeled by a probits regression for ordered categories j = ..., — 2 , — 1 , 0 , 1 , 2 ,..., 
with latent location parameter 13^, and with latent standard deviation that controls the 
level of modality of the conditional p.d.f. f{y \ x; ^) of the dependent variable Y. Specihcally, 
as 0 , the conditional p.d.f. f{y \ x; ^) becomes more unimodal. As gets larger, 

f{y I x; <^) becomes more multimodal (see Karabatsos & Walker, 2012a). 

The Bayesian Regression software allows the user to analyze data using any one of sev¬ 
eral versions of the inhnite-probits regression model (21). Versions include models where the 
kernel densities are instead specihed by covariate independent normal densities n{y \ ctj), 
and the mixture weights are modeled by: 

^,(x)^t( for2=0,±l.±2,...; (23) 

’ Wexp(xtA„)/ Vexp(xTA„) J 


18 








include models where either the individual regression coefficients f3 in the kernels, or the 
individual regression coefficients (/3^, A^) in the mixture weights (23) are assigned spike-and- 
slab priors using the SSVS method (George & McCulloch, 1993, 1997), to enable automatic 
variable (covariate) selection inferences from the posterior distribution; and include mixed- 
probit regression models for binary (0 or 1) or ordinal (c = 0,1 ,..., m) dependent variables, 
each with inverse-link function c.d.f. modeled by a covariate-dependent, infinite-mixture of 
normal densities (given by (21b), but instead for the continuous underlying latent dependent 
variable; see Karabatsos & Walker, 2015). 

3.3 Some Linear Models 

We briefly review two basic Bayesian normal linear models from standard textbooks (e.g., 
O’Hagan & Forster, 2004; Denison et ah 2003). 

First, the Bayesian normal linear model, assigned a (conjugate) normal inverse-gamma 
prior distribution to the coefficients and error variance parameters, (/3, a^), is defined by: 


Vi 

|Xj 

~ /(l/|xi), i = l,...,n 

(24a) 

fiy\ 

|x) 

= n(|/ 1 xT/3, a^) 

(24b) 

/^o 

cr^ 

~ N(0, OO) 

(24c) 

l^k 


~ N(0,a^ny3), k = l,...,p 

(24d) 



~ IG(ao/2, ao/2). 

(24e) 


An extension of the model (24) is provided by the Bayesian 2-level normal random-effects 
model (HLM). Again, let the data Vn = {(^j, Xj)}jh]^ be stratified into Nh groups, indexed 
by h = 1 ,..., Nfi- Also, for each group h, let yi(^h) be the Ah dependent observation, and let 
Yh = (|/i(/i))”()j)=i be the column vector of rih dependent observations, corresponding to an 
observed design matrix X/^ = (x|j.^^,..., ..., x))^) of Uh rows of covariate vectors xj^^^ 

respectively. Then a Bayesian 2-level model (HLM) can be represented by: 


yi{h) 1 

~ f{y\^i{h)), i{h) = l,...,nh 

(25a) 

f{y\^i{h)) 


(25b) 


= X^y^ -F X^W/j 

(25c) 


~ N(0, oo) 

(25d) 

l^k k 

~ N(0,aky3), fc = l,...,p 

(25e) 

Uh 1 T 

~ N(0,T), h = l,...,Nh 

(25f) 


~ IG(ao/2, ao/2) 

(25g) 

T 

~ IW(p-h3,soIp+i)- 

(25h) 


This model (25), as shown in (25f), assumes that the random coefficients Uh (for h = 
1 ,..., Nh) are normally distributed over the Nh groups. 

Both linear models above, and the different versions of these models mentioned in Section 
1, are provided by the Bayesian Regression software. See the Help menu for more details. 
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4 Using the Bayesian Regression Software 

4.1 Installing the Software 

The Bayesian Regression software is a stand-alone package for a 64-bit Windows 
computer'^. To install the software on your computer, take the following steps: 

1. Go the Bayesian Regression software web page: 

http://www.uic.edu/~georgek/HomePage/BayesSoftware.html. 

Then click the link on that page to download the Bayesian Regression software 
installation hie, named Bayeslnstaller_web64bit.exe (or Bayeslnstaller_web32bit.exe). 

2. Install the software by clicking the hie BayesInstaller_webXXbit.exe. This will include 
a web-based installation of MATLAB Compiler Runtime, if necessary. As you install, 
select the option ’’Add a shortcut to the desktop,” for convenience. (To install, be 
connected to the internet, and temporarily disable any hrewall or proxy settings on 
your computer). 

Then start the software by clicking the icon BayesRegXXbit.exe. 

The next subsection provides step-by-step instructions on how to use the Bayesian re¬ 
gression software to perform a Bayesian analysis of your data set. The software provides 
several example data hies, described under the Help menu. You can create them by click¬ 
ing the File menu option: ’’Create Bayes Data Examples hie folder.” Click the File menu 
option again to import and open an example data set from this folder. The next subsection 
illustrates the software through the analysis of the example data set PIRLSlOO.csv. 

The Bayesian Regression software, using your menu-selected Bayesian model, out¬ 
puts the data analysis results into space- and comma-delimited text hies with time-stamped 
names, which can be viewed in free NotePad++. The comma-delimited output hies in¬ 
clude the posterior samples (.MCI), model ht residual (*.RES), and the model specihcation 
(*.MODEL) hies. The software also outputs the results into graph (hgure *.hg) hies, which 
can then be saved into a EPS (*.eps), bitmap (*.bmp), enhanced metahle (*.emf), JPEG im¬ 
age (*.jpg), or portable document (*.pdf) hie format. Optionally you may graph or analyze 
a delimited text output hie after importing it into spreadsheet software (e.g., OpenOffice) or 
into the R software using the command line: ImportedData = read. csv(f ile. chooseO). 

4.2 Running the Software (12 Steps for Data Analysis) 

You can run the software for data analysis using any one of many Bayesian models of your 
choice. A data analysis involves running the following 12 basic steps (required or optional). 
In short, the 12 steps are as follows: 

(1) Import or open the data hie (Required); 

(2) Compute basic descriptive statistics and plots of your data (Optional); 

(3) Modify the data set (e.g., create variables) to set up your data analysis model (Optional); 

(4) Specify a new Bayesian model for data analysis (Required); 

(5) Specify observation weights (Optional); 

^An older version of the software can run on a 32-bit computer. 
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(6) Specify the censored observations (Optional); 

(7) Set up the MCMC sampling algorithm model posterior estimation (Required); 

(8) Click the Run Posterior Analysis button (Required); 

(9) Click the Posterior Summaries button to output data analysis results (Required); 

(10) Check MCMC convergence (Required); 

(11) Click the Posterior Predictive button to run model predictive analyses (Optional); 

(12) Click the Clear button to hnish your data analysis project. 

Then you may run a different data analysis. Otherwise, you may then Exit the software and 
return to the same data analysis project later, after re-opening the software. 

Below, we give more details on the 12 steps of data analysis. 

1. (Required) Use the File menu to Import or open the data file for analysis. 
Specihcally, the data hie that you import must be a comma-delimited hie, with name 
having the .csv extension. (Or, you may click the File menu option to open an existing 
(comma-delimited) data (*.DAT) hie). In the data hie, the variable names are located 
in the hrst row, with numeric data (i.e., non-text data) in all the other rows. For 
each row, the number of variable names must equal the number of commas minus 
1. The software allows for missing data values, each coded as NaN or as an empty 
blank. After you select the data hie to import, the software converts it into a comma- 
delimited data (*.DAT) hie. Figure 1 shows the interface of the Bayesian regression 
software. It presents the PIRLS100.DAT data set at the bottom of the interface, after 
the PIRLSlOO.csv hie has been imported. 

2. (Optional) Use the Describe/Plot Data Set menu option(s) to compute basic 
descriptive statistics and plots of the data variables. Statistics and plots include the 
sample mean, standard deviation, quantiles, frequency tables, cross-tabulations, corre¬ 
lations, covariances, univariate or bivariate histograms^, stem-and-leaf plots, univariate 
or bivariate kernel density estimates®, quantile-quantile (Q-Q) plots, two- or three- 
dimensional scatter plots, scatter plot matrices, (meta-analysis) funnel plots (Egger et 
al. 1997), box plots, and plots of kernel regression estimates with automatic bandwidth 
selection^. 

3. (Optional) Use the Modify Data Set menu option(s) to set up a data analysis. 
The menu options allow you to construct new variables, handle missing data, and/or to 

®For the univariate histogram, the bin size (h) is defined by the Freedman-Diaconis (1981) rule, with 
h = 2(IQR)n“^/^, where IQR is the interquartile range of the data, and n is the sample size. For the 
bivariate histogram, the automatic bin sizes are given by hk = where fc = 1, 2, is the sample 

standard deviation for the two variables (Scott, 1992). 

®The univariate kernel density estimate assumes normal kernels, with automatic bandwidth (h) given by 
the normal reference rule, defined hy h = where a is the data standard deviation, and n is the 

sample size (Silverman, 1986, p. 45). The bivariate kernel density estimate assumes normal kernels, with 
automatic bandwidth determined by the equations in Botev et al. (2010). 

^Kernel regression uses normal kernels, with an automatic choice of bandwidth (/i) given hy h = hxhy, 

where hx = med( | x — med(x) | )/c, and hy = med( | y — med(y) | )/c, where (x, y) give the vectors of X 
data and Y data (resp.), med(-) is the median, c = .6745 * (4/3/n)°-^, and n is the sample size (Bowman & 
Azzalini, 1997, p. 31) 
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perform other modifications of the data set. Then the new and/or modihed variables 
can be included in the Bayesian model that you select in Step 4. Figure 1 presents the 
PIRLS100.DAT data at the bottom of the software interface, and shows the data of the 
variables MALE, AGE, CLSIZE, ELL, TEXP4, EDLEVEL, ENROL, and SCHSAFE in 
the last 8 data columns, respectively, after taking z-score transformations and adding 
”Z:” to each variable name. Such transformations are done with the menu option: 
Modify Data Set > Simple variable transformations > Z score. Section 4.3 provides 
more details about the available Modify Data Set menu options. 

- FIGURE 1 in http://www.uic.edu/~georgek/HomePage/Figures .pdf - 

4. (Required) Click the Specify New Model button to select a Bayesian model 

for data analysis, and then for the model select: the dependent variable; the covari- 
ate(s) (predictor(s)) (if selected a regression model); the level-2 (and possibly level-3) 
grouping variables (if a multi-level model); and the model’s prior distribution param¬ 
eters. Figure 2 shows the software interface, after selecting the Inhnite homoscedastic 
probits regression model, along with the dependent variable, covariates, and prior pa¬ 
rameters. 

5. (Optional) To weight each data observation (row) differently under your selected 
model, click the Observation Weights bntton to select a variable containing the 
weights (must be hnite, positive, and non-missing). (This button is not available for 
a binary or ordinal regression model). By default, the observation weights are 1. For 
example, observation weights are used for meta-analysis of data where each dependent 
variable observation yi represents a study-reported effect size (e.g., a standardized 
mean difference in scores between a treatment group and a control group, or a corre¬ 
lation coefficient estimate). Each reported effect size yi has sampling variance and 
observation weight l/d\ that is proportional to the sample size for y^. Details about 
the various effect size measures, and their sampling variance formulas, are found in 
meta-analysis textbooks (e.g., Gooper, et al. 2009). Section 4.3 mentions a Modify 
Data Set menu option that computes various effect size measures and corresponding 
variances. 

- FIGURE 2 in http://www.uic. edu/~georgek/HomePage/Figures .pdf- 

6 . (Optional) Click the Censor Indicators of Y button, if the dependent variable 
consists of censored observations (not available for a binary or ordinal regression 
model). Censored observations often appear in survival data, where the dependent 
variable Y represents the (e.g., log) survival time of a patient. Formally, an observa¬ 
tion, yi, is censored when it is only known to take on a value from a known interval 
[YLBi,YuBi]', is interval-censored when —oo < Ylbi < YjjBi < oo; is right censored 
when —cx) < Y^Bi < YuBi = cxo; and left censored when —ex = Ylbi < YuBi < oo 
(e.g., Klein & Moeschberger, 2010). After clicking the Censor Indicators of Y bnt¬ 
ton, select the two variables that describe the (hxed) censoring lower-bounds {LB) 
and upper-bounds {UB) of the dependent variable observations. Name these variables 
LB and UB. Then for each interval-censored observation ?/,, its LBi and UBi values 
must be hnite, with LBi < UBi, yi < UBi, and y^ > LBi. For each right-censored 
observation y^, its LBi value must be hnite, with yi > LBi, and set UBi = —9999. 
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For each left-censored observation i/i, its UBi value must be finite, with yi < UBi, 
and set LBi = —9999. For each uncensored observation y^, set LBi = —9999 and 
UBi = -9999. 

7. (Required) Enter: the total number (S) of MC Samples, i.e., MCMC sam¬ 
pling iterations (indexed by s = 1,..., S', respectively); the number (sq > 1) of the 
initial Burn-In period samples; and the Thin number k, to retain every sampling 
iterate of the S total MCMC samples. The MCMC samples are used to estimate the 
posterior distribution (and functionals) of the parameters of your selected Bayesian 
model. Entering a thin value k > 1 represents an effort to have the MCMC samples 
be (pseudo-) independent samples from the posterior distribution. The burn-in num¬ 
ber (so) is your estimate of the number of initial MCMC samples that are biased by 
the software’s starting model parameter values used to initiate the MCMC chain at 
iteration s = 0. 

8 . (Required) Click the Run Posterior Analysis button to run the MCMC sam¬ 
pling algorithm, using the selected MC Samples, Burn-In, and Thin numbers. A wait- 
bar will then appear and display the progress of the MCMC sampling iterations. After 
the MCMC sampling algorithm hnishes running, the software will create an external: 
(a) model (*.MODEL) text hie that describes the selected model and data set; (b) 
Monte Carlo (*.MC1) samples hie which contains the generated MCMC samples; (c) 
residual (*.RES) hie that contains the model’s residual ht statistics; and (d) an opened, 
text output hie of summaries of (marginal) posterior point-estimates of model param¬ 
eters and other quantities, such as model predictive data-ht statistics. The results are 
calculated from the generated MCMC samples aside from any burn-in or thinned-out 
samples. Model ht statistics are calculated from all MCMC samples instead. The soft¬ 
ware creates all output hies in the same subdirectory that contains the data (*.DAT) 
hie. 

9. (Required) Click the Posterior Summaries button to select menu options for 
additional data analysis output, such as: text output of posterior quantile estimates of 
model parameters and 95% Monte Carlo Conhdence Intervals (see Step 10); trace plots 
of MCMC samples; 2-dimensional plots and 3-dimensional plots of (kernel) density es¬ 
timates, univariate and bivariate histograms, distribution function, quantile function, 
survival function, and hazard functions, box plots. Love plots, Q-Q plots, and Wright 
maps, of the (marginal) posterior distribution(s) of the model parameters; posterior 
correlations and covariances of model parameters; and plots and tables of the model’s 
standardized ht residuals. The software creates all text output hies in the same sub¬ 
directory that contains the data (*.DAT) hie. You may save any graphical output in 
the same directory. 

10. (Required) Click the Posterior Summaries button for menu options to check the 
MCMC convergence of parameter point-estimates, for every model parameter of 
interest for data analysis. Verify: (1) that the univariate trace plots present good mixing 
of the generated MCMC samples of each parameter; and (2) that the generated MCMC 
samples of that parameter provide sufhciently-small half-widths of the 95% Monte 
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Carlo Confidence Intervals (95% MCCIs) for parameter posterior point-estimates of 
interest (e.g., marginal posterior mean, standard deviation, qnantiles, etc.). If for 
yonr model parameters of interest, either the trace plots do not support adequate 
mixing (i.e., plots are not stable and ’’hairy”), or the 95% MCCI half-widths are not 
sufficiently small for practical purposes, then the MCMC samples of these parameters 
have not converged to samples from the model’s posterior distribution. In this case you 
need to generate additional MCMC samples, by clicking the Run Posterior Analysis 
button again. Then re-check for MCMC convergence by evaluating the updated trace 
plots and the 95% MCCI half-widths. This process may be repeated until MCMC 
convergence is reached. 

11. (Optional) Click the Posterior Predictive button® to generate model’s pre¬ 
dictions of the dependent variable P, conditionally on selected values of one or more 
(focal) covariates (predictors). See Section 2 for more details. Then select the posterior 
predictive functionals of Y of interest. Choices of functionals include the mean, vari¬ 
ance, quantiles (to provide a quantile regression analysis), probability density function 
(p.d.f.), cumulative distribution function (c.d.f.), survival function, hazard function, 
the cumulative hazard function, and the probability that Y > 0. Then select one or 
more focal covariate(s) (to define x^), and then enter their values, in order to study 
how predictions of Y varies as a function of these covariate values. For example, if you 
chose the variable ZiCLSIZE as a focal covariate, then you may enter values like —1.1, 
.02, 3.1, so that you can make predictions of Y conditionally on these covariate values. 
Or you may base predictions on an equally-spaced grid of covariate (Z:CLSIZE) val¬ 
ues, like —3, —2.5, —2,... 2, 2.5, 3, by entering —3 : .5 : 3. If your data set observations 
are weighted (optional Step 4^5), then specify a weight value for the Y predictions. 
Next, if your selected focal covariates do not constitute all model covariates, then se¬ 
lect among options to handle the remaining (non-focal) covariates. Options include 
the grand-mean centering method, the zero-centering method, the partial dependence 
method, and the clustered partial dependence method. After you made all the selections, 
the software will provide estimates of your selected posterior predictive functionals of 
Y, conditionally on your selected covariate values, in graphical and text output hies, 
including comma-delimited hies. (The software generates graphs only if you selected 
1 or 2 focal covariates; and generates no output if you specify more than 300 distinct 
values of the focal covariate(s)). All analysis output is generated in the same subdi¬ 
rectory that contains the data (*.DAT) hie. You may save any graphical output in the 
same directory. 

12. (Required) After completing the Bayesian data analysis, you may click the Clear 
button. Then you may start a diherent data analysis with another Bayesian model, or 
exit the software. Later, you may return to and continue from a previous Bayesian data 
analysis (involving the same model and data set) by using menu options to generate 
new MCMC samples and/or new data analysis output. To return to the previous 

®This button is not available for Bayesian models for density estimation. However, for these models, the 
software provides menu options to output estimates of the posterior predictive distribution, after clicking 
the Posterior Summaries button. They include density and c.d.f. estimates. See Step 9. 
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analysis, go to the File menu to open the (relevant) data hie (if necessary), and then 
click the Open Model button to open the old model (*.MODEL) hie. Then, the 
software will load this model hie along with the associated MCMC samples (*.MC1) 
hie and residual (*.RES) hies. (Returning to a previous Bayesian regression analysis is 
convenient if you have already stored the data, model, MC samples, and residual hies 
all in the same hie subdirectory). Then after clicking the Run Posterior Analysis 
button, the newly generated MCMC samples will append the existing MCMC samples 
(*.MC1) hie and update the residual (*.RES) hie. 

Finally, the software provides a Fast Ridge Regression menu option that performs a 
Bayesian data analysis using the ridge (linear) regression model (Hoerl & Kennard, 1970), 
with parameters estimated by a fast marginal maximum likelihood algorithm (Karabatsos, 
2014). This menu option can provide a fast analysis of an ultra-large data set, involving 
either a very large sample size and/or number of covariates (e.g., several thousands). At this 
point we will not elaborate on this method because it is currently the subject of ongoing 
research. 

4.3 Modify Data Set Menu Options 

Some Modify Data Set menu options allow you to construct new variables from your 
data. These new variables may be included as either covariates and/or the dependent variable 
for your Bayesian data analysis model. Methods for constructing new variables include: 
simple transformations of variables (e.g., z-score, log, sum of variables); transforming a 
variable into an ehect size dependent variable for meta-analysis (Borenstein, 2009; Fleiss & 
Berlin, 2009; Rosenthal, 1994); the creation of lagged dependent variables as covariates for 
a Bayesian autoregression time-series analysis (e.g., Prado & West, 2010); dummy/binary 
coding of variables; the construction of new covariates from other variables (covariates), via 
transformations including: polynomials, two-way interactions between variables, univariate 
or multivariate thin-plate splines (Green & Silverman, 1993) or cubic splines (e.g., Denison, 
et ah 2002); spatial-weight covariates (Stroud, et ah 2001; or thin-plate splines; Nychka, 
2000 ) from spatial data (e.g., from latitude and longitude variables) for spatial data analysis. 

Now we briefly discuss Modify Data Set menu options that can help set up a causal 
analysis of data from a non-randomized (or randomized) study. First, a propensity score 
variable, included as a covariate in a regression model; or as a dummy-coded covariate that 
stratihes each subject into one of 10 (or more) ordered groups of propensity scores; or as ob¬ 
servations weights (entered as the inverse of the propensity scores); can help reduce selection 
bias in the estimation of the causal effect (slope coefficient) of a treatment-receipt (versus 
non-treatment/control) indicator covariate on a dependent variable of interest (Rosenbaum 
& Rubin, 1983, 1984; Imbens, 2004; Lunceford & Davidian, 2004; Schafer & Kang, 2008; 
Hansen, 2008). As an alternative to using propensity scores, we may consider a regression 
discontinuity design analysis (Thistlewaite & Campbell, 1960; Cook, 2008). This would in¬ 
volve specifying a regression model, with dependent (outcome) variable of interest regressed 
on covariates that include an assignment variable, and a treatment assignment variable that 
indicates (0 or 1) whether or not the assignment variable exceeds a meaningful threshold. 
Then under mild conditions (Hahn, et al. 2001; Lee & Lemieux, 2010), the slope coeffi¬ 
cient estimate for the treatment variable is a causal effect estimate of the treatment (versus 
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non-treatment) on the dependent variable. For either the propensity score or regression dis¬ 
continuity design approach, which can be set up using appropriate Modify Data Set menu 
options, causal effects can be expressed in terms of treatment versus non-treatment compar¬ 
isons of general posterior predictive functionals of the dependent variable (e.g., Karabatsos 
& Walker, 2015). 

In some settings, it is of interest to include a multivariate dependent variable (multiple 
dependent variables) in the Bayesian regression model (Step 4). You can convert a multi¬ 
variate regression problem into a univariate regression problem (Gelman, et al., 2004, Ch. 
19) by clicking a menu option that ’’vectorizes” or collapses the multiple dependent variables 
into a single dependent variable. This also generates new covariates in the data set, having 
a block design that is suited for the (vectorized) multiple dependent variables. To given 
an example involving item response theory (IRT) modeling (e.g., van der Linden, 2015), 
each examinee (data set row) provides data on responses to J items (data columns) on a test 
(e.g., examination or questionnaire). Here, a menu option can be used to vectorize (collapse) 
the J item response columns into a single new dependent variable (data column), and to 
create corresponding dummy item indicator (—1 or 0) covariates. In the newly-revised data, 
each examinee occupies J rows of data. Then a Bayesian regression model of the software, 
which includes the new dependent variable, item covariates, and the examinee identiher as 
a grouping variable (or as dummy (0,1) indicator covariates), provides an Item Response 
Theory (IRT) model for the data (van der Linden, 2015). 

Similarly, it may be of interest to include a categorical (polychotomous) dependent vari¬ 
able in your Bayesian regression model. Assume that the variable takes on m -|- 1 (possibly 
unordered) categorical values, indexed by c = 0,1,..., m, with c = 0 the reference category. 
In this case, you may run a menu option that recodes the categorical dependent variables 
into m -|- 1 binary (e.g., 0 or 1) variables, then vectorizes (collapses) these multiple binary 
variables (data columns) into a single binary dependent variable (column), and reformats 
the covariate data into a block design suitable for the multiple binary variables. Then for 
data analysis, you may specify a binary regression model that includes the (collapsed) binary 
dependent variable and the reformatted covariates (Begg & Gray, 1984; also see Wu et al. 
2004). 

The Modify Data Set menu also provides menu options to reduce the number of vari¬ 
ables for dimension reduction, including: principal components, multidimensional scaling, or 
scaling by the true score test theory and Gronbach’s alpha reliability analysis of test items, 
and propensity scoring. There are also menu options that handle missing data, including: 
nearest-neighbor hot-deck imputation of missing data (Andridge & Little, 2010); the pro¬ 
cessing of multiple (missing data) imputations or plausible values obtained externally; and 
the assignment of missing (NaN) values based on specihc missing data codes (e.g., 8, 9, or 
-99, etc.) of variables. Finally, there are Modify Data Set menu options for more basic data 
editing and reformatting, including: adding data on a row-identihcation (ID) variable; ag¬ 
gregating (e.g., averaging) variables by a group identihcation variable; the selection of cases 
(data rows) for inclusion in a reduced data set, including random selection or selection by 
values of a variable; the sorting of cases; the deletion of variables; changing the variable 
name; and moving the variables into other columns of the data hie. 
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5 Real Data Example 

We illustrate the software through the analysis of data set hie PIRLSlOO.csv, using the 
software steps outlined in the previous section. The data are based on a random sample of 
100 students from 21 low-income U.S. elementary schools, who took part of the 2006 Progress 
in International Reading Literacy Study (PIRLS). 

- FIGURES in http://www.uic. edu/~georgek/HomePage/Figures .pdf - 

The dependent variable is student literacy score (zREAD). The eight covariates are: 
student male status (Z:MALE) indicator (0 or 1) and age (AGE); student’s class size (SIZE) 
and class percent English language learners (ELL); student’s teacher years of experience 
(TEXP4) and education level (TEXP4 = 5 if bachelor’s degree; TEXP4 = 6 if at least 
master’s degree); student’s school enrollment (ENROL) and safety rating (SAFE = 1 is 
high; SAFE = 3 is low). Figure 3 presents the univariate descriptive statistics of these 
variables, from a text output hie obtained by the menu option: Describe/Plot Data Set > 
Summaries and Frequencies > Univariate descriptives. Data is also available on a school 
identiher variable (SGHID). 

The data for each of the 8 covariates were transformed into z-scores having mean 0 and 
variance 1, using the menu option: Modify Data Set > Simple variable transformations > 
Z-score. These transformations will make the slope coefficients of the 8 covariates (resp.) 
interpretable on a common scale, in a regression analysis. 

The BNP inhnite-probits mixture model (21) was ht to the PIRLSIOO data, using the 
dependent variable and the 8 z-scored covariates. For the model’s prior density function (22), 
the following software defaults were specihed for the prior parameters: = 5, v = 100, 

Oq = 0.01, = 10, and = 0.01. This represents an attempt to specify a rather non- 

informative prior distribution for the model parameters. In order to estimate the model’s 
posterior distribution, 50,000 MGMG sampling iterations were run, using an initial burn-in 
of 2,000 and thinning interval of 5. 

Figure 4 presents the model’s predictive data-ht statistics (including their 95% MGGI 
half-widths), from text output that was opened by the software immediately after the 50K 
MGMG sampling iterations were run. The model obtained a D{m) statistic of 32, with an 
R-squared of .96, and had no outliers according to standardized residuals that ranged within 
—2 and 2. 

- FIGURE 4 in http://www.uic . edu/~georgek/HomePage/Figures .pdf - 

- FIGURE 5 in http://www.uic . edu/~georgek/HomePage/Figures .pdf - 

Figure 5 presents the (marginal) posterior point-estimates of all the model parameters, 
calculated from the 50K MGMG samples (aside from the burn-in and thinned-out MGMG 
samples). Estimates include the marginal posterior mean, median, standard deviation, 50% 
credible interval (given by the 25% (.25 quantile) and 75% (.75 quantile) output columns), 
and the 95% credible interval (2.5%, 97.5% columns). Figure 5 is part of the same text 
output that reported the model ht statistics (Figure 4). 

Figure 6 is a box plot of the (marginal) posterior quantile point-estimates of the intercept 
and slope coefficient parameters for the 8 covariates (i.e., parameters (3 = (/9o,/9i, • • • ,/5p)^)- 


27 


A red box (blue box, resp.) flags a coefficient parameter that is (not, resp.) significantly 
different than zero, according to whether or not the 50% (marginal) posterior interval (box) 
includes zero. The box plot menu option is available after clicking the Posterior Summaries 
button. 


- FIGURE 6 in http://www.uic. edu/~georgek/HomePage/Figures .pdf - 

MCMC convergence, for the parameters of interest for data analysis, can be evaluated 
by clicking certain menu options after clicking the Posterior Summaries button. Figure 7 
presents the trace plots for the model’s intercept and slope parameters sampled over the 
50K MCMC sampling iterations. The trace plots appear to support good mixing for these 
parameters, as each trace plot appears stable and ’’hairy.” Figure 8 presents the 95% MCCI 
half-widths of the (marginal) posterior coefficient point-estimates of Figure 5. The half¬ 
widths are nearly all less than .10, and thus these posterior point-estimates are reasonably 
accurate in terms of Monte Carlo standard error. If necessary, the software can re-click 
the Posterior Analysis button, to run additional MCMC sampling iterations, to obtain more 
precise posterior point-estimates of model parameters (as would be indicated by smaller 95% 
MCCI half-widths). 

- FIGURE 7 in http://www.uic. edu/~georgek/HomePage/Figures .pdf - 

- FIGURES in http://www.uic. edu/~georgek/HomePage/Figures .pdf - 

- FIGURE 9 in http://www.uic. edu/~georgek/HomePage/Figures .pdf - 

- FIGURE 10 in http://www.uic. edu/~georgek/HomePage/Figures .pdf - 

The user can click the Posterior Predictive button to investigate how chosen features 
(functionals) of the posterior predictive distribution varies as a function of one or more 
selected (focal) covariates, through graphical and text output. 

Figure 9 provides a quantile and mean regression analysis, by showing the estimates 
of the mean, and the .1, .25, .5 (median), .75, and .9 quantiles of the model’s posterior 
predictive distribution of zREAD, conditionally on selected values —2, —1.5,..., 1,1.5, 2 of 
the ZiCLSIZE covariate, and on zero for all the 7 other covariates (using the zero-centering 
method). It presents nonlinear relationships between literacy performance (zREAD) and 
class size (Z:CLSIZE). In particular, for higher literacy-performing students (i.e., .75 and 
.90 quantiles of zREAD), there is a highly nonlinear relationship between zREAD and class 
size. For lower performing students (i.e., .1 and .25 quantiles of zREAD), there is a flatter 
relationship between zREAD and class size. For ’’middle” performing students (mean and 
.5 quantiles of zREAD), there is a rather flat relationship between zREAD and class size. 

Figure 10 shows a 3-dimensional plot of (Rao-Blackwellized) estimates of the model’s 
posterior predictive p.d.f. of zREAD, conditionally on the same values of the 8 covariates 
mentioned earlier. As shown, both the location and shape of the zREAD distribution changes 
as a function of class size. For each posterior predictive p.d.f. estimate (panel), a homoge¬ 
neous cluster of students (in terms of a shared zREAD score) is indicated by a bump (or 
mode) in the p.d.f. 

- FIGURE 11 in http://www.uic. edu/~georgek/HomePage/Figures .pdf - 


The ANOVA-linear DDP model (18) was fit to the PIRLSIOO data. For this model’s prior 
(19), the following parameters were chosen: vq = 10, Sq = 10, Oq = 2, = 1, and = 1. 

This was an attempt to specify a rather non-informative prior distribution for the model pa¬ 
rameters. This model was estimated using 50,000 MCMC sampling iterations, with burn-in 
of 2,000 and thinning interval of 5. MCMC convergence was confirmed according to univari¬ 
ate trace plots and small 95% MCCI half-widths for (marginal) posterior point-estimates 
(nearly all less than .1). Figure 11 presents the marginal posterior densities (distributions) 
of the intercept and slope parameters, from the mixture distribution of the model. Each of 
these distributions (densities) are skewed and/or multimodal, unlike normal distributions. 
Each mode (bump) in a distribution (density) refers to a homogeneous cluster of schools, in 
terms of a common value of the given (random) coefficient parameter. 

Finally, recall that for the PIRLSIOO data, the infinite-probits mixture model (21) ob¬ 
tained a D{m) statistic of 32, an R-squared of .96, and no outliers according to all the ab¬ 
solute standardized residuals being less than 2 (Figure 4). The ANOVA-linear DDP model 
(18) obtained a D{m) statistic of 113, an R-squared of .64, and no outliers. The ordinary 
linear model obtained a D{rn) statistic of 139, an R-squared of .24 and 5 outliers. This linear 
model assumed rather non-informative priors, with ~ N(0,r’o —)■ cxo), ~ N(0,1000), 
fc = 1,..., 8, and ~ IG(.001, .001). 

6 Conclusions 

We described and illustrated a stand-alone, user-friendly and menu-driven software pack¬ 
age for Bayesian data analysis. Such analysis can be performed using any one of many 
Bayesian models that are available from the package, including BNP inhnite-mixture regres¬ 
sion models and normal random-effects linear models. As mentioned in the data analysis 
exercises of Appendix B, the software can be used to address a wide range of statistical ap¬ 
plications that arise from many scientihc helds. In the future, new Bayesian mixture models 
will be added to the software. They will include mixture models dehned by other kernel 
density functions, and models with parameters assigned a mixture of Polya Trees prior. 

7 Acknowledgments 

This paper is supported by NSF-MMS research grant SES-1156372. 


29 


References 


Agresti A (2002). Categorical Data Analysis. John Wiley and Sons. 

Albert J, Chib S (1993). “Bayesian Analysis of Binary and Polychotomous Response Data.” 
Journal of the American Statistical Association, 88, 669-679. 

Andridge RR, Little R (2010). “A Review of Hot Deck Impntation for Snrvey Non¬ 
response.” International Statistical Review, 18, 40-64. 

Antoniak C (1974). “Mixtnres of Dirichlet Processes with Applications to Bayesian Non- 
parametric Problems.” Annals of Statistics, 2, 1152-1174. 

Atchade Y, Rosenthal J (2005). “On Adaptive Markov chain Monte Carlo Algorithms.” 
Bernoulli, 11, 815-828. 

Barry D, Hartigan J (1993). “A Bayesian-Analysis for Change Point Problems.” Journal 
of the American Statistical Association, 88, 309-319. 

Begg C, Gray R (1996). “Calculation of Polychotomous Logistic Regression Parameters 
Using Individualized Regressions.” Biometrika, 71, 11-18. 

Bernardo J, Smith A (1994). Bayesian Theory. Wiley, Chichester, England. 

Bertoin J (1998). Levy Processes. Cambridge University Press, Cambridge. 

Borenstein M (2009). “Effect Sizes for Continuous Data.” In H Cooper, L Hedges, J Valen¬ 
tine (eds.). The Handbook of Research Synthesis and Meta-Analysis (2nd Ed.), pp. 
221-235. Russell Sage Foundation, New York. 

Botev Z, Grotowski J, Kroese D (2010). “Kernel Density Estimation via Diffusion.” The 
Annals of Statistics, 38, 2916-2957. 

Bowman A, Azzalini A (1997). Applied Smoothing Technigues for Data Analysis: The 
Kernel Approach with S-Plus Illustrations. Oxford University Press, Oxford. 

Brooks S (1998). “Quantitative Convergence Assessment for Markov chain Monte Carlo 
via Cusums.” Statistics and Computing, 8, 267-274. 

Brooks S, Gelman A, Jones G, Meng X (2011). Handbook of Markov Chain Monte Carlo. 
Chapman and Hall/CRC, Boca Raton, EL. 

Burr D (2012). “bspmma: An R package for Bayesian Semiparametric Models for Meta- 
Analysis.” Journal of Statistical Software, 50, 1-23. 

Burr D, Doss H (2005). “A Bayesian Semiparametric Model for Random-Effects Meta- 
Analysis.” Journal of the American Statistical Association, 100, 242-251. 

Casella G, Berger R (2002). Statistical Inference (2nd Ed.). Duxbury, Pacihc Grove, CA. 


30 


Cepeda E, Gamerman D (2001). “Bayesian modeling of variance heterogeneity in normal 
regression models.” Brazilian Journal of Probability and Statistics, If, 207-221. 

Cook T (2008). “Waiting for Life to Arrive: A History of the Regression-Discontinuity 
Design in Psychology, Statistics and Economics.” Journal of Econometrics, If2, 636- 
654. 

Cooper H, Hedges L, Valentine J (2009). The Handbook of Research Synthesis and Meta- 
Analysis (Second Edition). Russell Sage Foundation, New York. 

De lorio M, Miiller P, Rosner G, MacEachern S (2004). “An ANOVA model for dependent 
random measures.” Journal of the American Statistical Association, 99, 205-215. 

DeBlasi P, Favaro S, Lijoi A, Mena R, Priinster I, Ruggiero M (2015). “Are Gibbs-Type 
Priors the Most Natural Generalization of the Dirichlet Process?” IEEE Transactions 
on Pattern Analysis and Machine Intelligence, 37, 212-229. 

Denison D, Holmes C, Mallick B, Smith A (2002). Bayesian Methods for Nonlinear Clas¬ 
sification and Regression. John Wiley and Sons, New York. 

Di Lucca M, Guglielmi A, Muller P, Quintana F (2012). “A Simple Class of Bayesian 
Non-parametric Autoregression Models.” Bayesian Analysis, 7, 771-796. 

Egger M, Smith GD, Schneider M, Minder C (1997). “Bias in Meta-Analysis Detected by 
a Simple, Graphical Test.” British Medical Journal, 315, 629-634. 

Escobar M, West M (1995). “Bayesian Density Estimation and Inference Using Mixtures.” 
Journal of the American Statistical Association, 90, 577-588. 

Evans I (1965). “Bayesian Estimation of Parameters of a Multivariate Normal Distribu¬ 
tion.” Journal of the Royal Statistical Society, Series B, 27, 279-283. 

Favaro S, Lijoi A, Priinster I (2012). “On the Stick-Breaking Representation of Normalized 
Inverse Gaussian Priors.” Biometrika, 99, 663-674. 

Ferguson T (1973). “A Bayesian Analysis of Some Nonparametric Problems.” Annals of 
Statistics, 1, 209-230. 

Flegal J, Jones G (2011). “Implementing Markov Chain Monte Carlo: Estimating with 
Conhdence.” In S Brooks, A Gelman, G Jones, X Meng (eds.). Handbook of Markov 
Chain Monte Carlo, pp. 175-197. CRC, Boca Raton, EL. 

Fleiss J, Berlin J (2009). “Effect Size for Dichotomous Data.” In H Cooper, L Hedges, J 
Valentine (eds.). The Handbook of Research Synthesis and Meta-Analysis (2nd Ed.), 
pp. 237-253. Russell Sage Foundation, New York. 

Freedman D, Diaconis P (1981). “On the Histogram as a Density Estimator: L2 theory.” 
Probability Theory and Related Pields, 57, 453-476. 


31 


Friedman J (2001). “Greedy Function Approximation: A Gradient Boosting Machine.” 
Annals of Statistics, 29, 1189-1232. 

Fuentes-Garcia R, Mena R, Walker S (2009). “A Nonparametric Dependent Process for 
Bayesian Regression.” Statistics and Probability Letters, 79, 1112-1119. 

Fuentes-Garcia R, Mena R, Walker S (2010). “A New Bayesian Nonparametric Mixture 
Model.” Communications in Statistics, 39, 669-682. 

Gelfand A, Diggle P, Guttorp P, Fuentes M (2010). Handbook of Spatial Statistics. Ghap- 
man and Hall/GRG, Boca Raton. 

Gelfand A, Ghosh J (1998). “Model Ghoice: A Minimum Posterior Predictive Loss Ap¬ 
proach.” Biometrika, 85, 1-11. 

Gelfand A, Mukhopadhyay S (1995). “On Nonparametric Bayesian Inference for the Dis¬ 
tribution of a Random Sample.” Canadian Journal of Statistics, 23, 411-420. 

Gelfand A, Smith A, Lee TM (1992). “Bayesian Analysis of Gonstrained Parameter and 
Truncated Data Problems Using Gibbs Sampling.” Journal of the American Statistical 
Association, 87, 523-532. 

Gelman A, Garlin A, Stern H, Rubin D (2004). Bayesian Data Analysis (Second Edition). 
Ghapman and Hall, Boca Raton, Florida. 

George E, McGulloch R (1993). “Variable Selection via Gibbs Sampling.” Journal of the 
American Statistical Association, 88, 881-889. 

George E, McGulloch R (1997). “Approaches for Bayesian Variable Selection.” Statistica 
Sinica, 7, 339-373. 

Geyer G (2011). “Introduction to MGMG.’Tn S Brooks, A Gelman, G Jones, X Meng 
(eds.). Handbook of Markov Chain Monte Carlo, pp. 3-48. GRG, Boca Raton, FL. 

Ghosh J, Ramamoorthi R (2003). Bayesian Nonparametrics. Springer, New York. 

Gilks W, Wang G, Yvonnet B, Goursaget P (1993). “Random-Effects Models for Longitu¬ 
dinal Data Using Gibbs Sampling.” Biometrics, 49, 441-453. 

Green P, Silverman B (1993). Nonparametric Regression and Ceneralized Linear Models: 
A Roughness Penalty Approach. GRG Press, Los Altos, GA. 

Hahn J, Todd P, der Klaauw WV (2001). “Identihcation and Estimation of Treatment 
Effects with a Regression-Discontinuity Design.” Econometrica, 69, 201-209. 

Hansen B (2008). “The Prognostic Analogue of the Propensity Score.” Biometrika, 95, 
481-488. 

Hanson T (2006). “Inference for Mixtures of Finite Polya Tree Models.” Journal of the 
American Statistical Association, 101, 1548-1565. 


32 


Hartiffan J (1990). “Partition Models.” Communications in Statistics: Theoru and Methods, 
19, 2745-2756. 

Hastie T, Tibshiriani R, Friedman J (2009). The Elements of Statistical Learning: Data 
Mining, Inference, and Prediction (2nd Ed.). Springer-Verlag, New York. 

Hedges L (1981). “Distribution Theory for Glass’s Estimator of Effect size and Related 
Estimators.” Journal of Educational and Behavioral Statistics, 6, 107-128. 

Hjort N, Holmes C, Miiller P, Walker S (2010). Bayesian Nonparametrics. Cambridge 
University Press, Cambridge. 

Hoerl A, Kennard R (1970). “Ridge Regression: Biased Estimation for Nonorthogonal 
Problems.” Technometrics, 12, 55-67. 

Holmes C, Held K (2006). “Bayesian Auxiliary Variable Models for Binary and Multinomial 
Regression.” Bayesian Analysis, 1, 145-168. 

IBM Corp (2015). IBM SPSS Statistics for Windows, v22.0. IBM Corp., Armonk, NY. 

Imbens G (2004). “Nonparametric Estimation of Average Treatment Effects Under Exo¬ 
geneity: A Review.” The Review of Economics and Statistics, 86, 4-29. 

Imbens GW, Lemieux T (2008). “Regression Discontinuity Designs: A Guide to Practice.” 
Journal of Econometrics, 142, 615-635. 

Ishwaran H, James L (2001). “Gibbs Sampling Methods for Stick-Breaking Priors.” Journal 
of the American Statistical Association, 96, 161-173. 

Ishwaran H, Zarepour M (2000). “Markov chain Monte Carlo in Approximate Dirichlet 
and Beta Two-Parameter Process Hierarchical Models.” Biometrika, 87, 371-390. 

James L, Lijoi A, Priinster I (2009). “Posterior Analysis for Normalized Random Measures 
with Independent Increments.” Scandinavian Journal of Statistics, 36, 76-97. 

Jara A, Hanson T, Quintana F, Muller P, Rosner G (2011). “DPpackage: Bayesian Semi- 
and Nonparametric Modeling in R.” Journal of Statistical Software, fO, 1-20. 

Jara A, Lesaffre E, Delorio M, Quintana F (2010). “Bayesian Semiparametric Inference for 
Multivariate Doubly-Interval-Censored Data.” Annals of Applied Statistics, 4, 2126- 
2149. 

Johnson N, Kotz S, Balakrishnan N (1994). Continuous Univariate Distributions, Vol. 1. 
Wiley, New York. 

Jordan M, Jacobs R (1994). “Hierarchical Mixtures of Experts and the EM Algorithm.” 
Neural Computation, 6, 181-214. 

Kalli M, Griffin J, Walker S (2011). “Slice Sampling Mixture Models.” Statistics and 
Computing, 21, 93-105. 


33 


Karabatsos G (2014). “Fast Marginal Likelihood Estimation of the Ridge Parameter in 
Ridge Regression.” Technical Report 1409.2437, arXiv preprint. 

Karabatsos G, Talbott E, Walker S (2015). “A Bayesian Nonparametric Meta-Analysis 
Model.” Research Synthesis Methods, 6, 28-44. 

Karabatsos G, Walker S (2012a). “Adaptive-Modal Bayesian Nonparametric Regression.” 
Electronic Journal of Statistics, 6, 2038-2068. 

Karabatsos G, Walker S (2012b). “Bayesian Nonparametric Mixed Random Utility Mod¬ 
els.” Computational Statistics and Data Analysis, 56, 1714-1722. 

Karabatsos G, Walker S (2015). “Bayesian Nonparametric Item Response Models.” In 
W. van der Linden (ed.), Handbook Of Item Response Theory, Volume 1: Models, 
Statistical Tools, and Applications. Taylor and Francis, New York. 

Karabatsos G, Walker S (2015, to appear). “A Bayesian Nonparametric Gausal Model 
for Regression Discontinuity Designs.” In P Miiller, R Mitra (eds.), Nonparametric 
Bayesian Methods in Biostatistics and Bioinformatics. Springer-Verlag, New York. 
(ArXiv preprint 1311.4482). 

Kingman J (1975). “Random Discrete Distributions.” Journal of the Royal Statistical So¬ 
ciety, Series B, 37, 1-22. 

Klein J, Moeschberger M (2010). Survival Analysis (2nd Ed.). Springer-Verlag, New York. 

Laud P, Ibrahim J (1995). “Predictive Model Selection.” Journal of the Royal Statistical 
Society, Series B, 57, 247-262. 

Lee D (2008). “Randomized Experiments from Non-Random Selection in U.S. House Elec¬ 
tions.” Journal of Econometrics, 142, 675-697. 

Lee D, Lemieux T (2010). “Regression Discontinuity Designs in Economics.” The Journal 
of Economic Literature, 48, 281-355. 

Lijoi A, Mena R, Priinster I (2005). “Hierarchical Mixture Modeling with Normalized 
Inverse-Gaussian Priors.” Journal of the American Statistical Association, 100, 1278- 
1291. 

Lindley D, Smith A (1972). “Bayes Estimates for the Linear Model (with Discussion).” 
Journal of the Royal Statistical Society, Series B, 34, 1-41. 

Lo A (1984). “On a Glass of Bayesian Nonparametric Estimates.” Annals of Statistics, 12, 
351-357. 

Lunceford J, Davidian M (2004). “Strati... cation and Weighting Via the Propensity 
Score in Estimation of Gausal Treatment Effects: A Gomparative Study.” Statistics 
in Medicine, 23, 2937-2960. 


34 


MacEachern S (1999). “Dependent Nonparanietric Processes.” Proceedings of the Bayesian 
Statistical Sciences Section of the American Statistical Association, pp. 50-55. 

MacEachern S (2000). “Dependent Dirichlet Processes.” Technical report, Department of 
Statistics, The Ohio State University. 

MacEachern S (2001). “Decision Theoretic Aspects of Dependent Nonparametric Pro¬ 
cesses.” In E George (ed.), Bayesian Methods with Applications to Science, Policy and 
Official Statistics, pp. 551-560. International Society for Bayesian Analysis, Greta. 

McGullagh P, Nelder J (1989). Generalized Linear Models (2nd Ed.). Ghapman and Hall, 
London. 

McLachlan G, Peel D (2000). Finite Mixture Models. John Wiley and Sons, New York. 

Mitra R, Muller P (2015, to appear). Nonparametric Bayesian Methods in Biostatistics 
and Bioinformatics. Springer-Verlag. New York. 

Molenberghs G, Verbeke G (2005). Models for Discrete Longitudinal Data. Springer-Verlag, 
New York. 

Miiller P, Quintana F (2004). “Nonparametric Bayesian Data Analysis.” Statistical Science, 
19, 95-110. 

Miiller P, Rosner G, lorio MD, MacEachern S (2005). “A Nonparametric Bayesian Model 
for Inference in Related Studies.” Applied Statistics, 54, 611-626. 

Neal R (2003). “Slice Sampling (with Discussion).” Annals of Statistics, 31, 705-767. 

Nychka D (2000). “Spatial-Process Estimates as Smoothers.” In Smoothing and Regression: 
Approaches, Computation, and Application, pp. 393-424. John Wiley and Sons, New 
York. 

O’Hagan A, Forster J (2004). Kendall’s Advanced Theory of Statistics: Bayesian Inference, 
volume 2B. Arnold, London. 

Perman M, Pitman J, Yor M (1992). “Size-Biased Sampling of Poisson Point Processes 
and Excursions.” Probability Theory and Related Fields, 92, 21-39. 

Pitman J (1995). “Exchangeable and Partially Exchangeable Random Partitions.” Proba¬ 
bility Theory and Related Fields, 102, 145-158. 

Pitman J (1996). “Some Developments of the Blackwell-MacQueen Urn Scheme.” In T 
Ferguson, L Shapeley, J MacQueen (eds.). Statistics, Probability and Game Theory. 
Papers in Honor of David Blackwell, pp. 245-268. Institute of Mathematical Sciences, 
Hayward, GA. 

Pitman J, Yor M (1997). “The Two-Parameter Poisson-Dirichlet Distribution Derived from 
a Stable Subordinator.” Annals of Probability, 25, 855-900. 


35 


Plummer M, Best N, Cowles K, Vines K (2006). “CODA: Convergence Diagnosis and 
Output Analysis for MCMC.” R News, 6, 7-11. 

Prado R, West M (2010). Time Series: Modeling, Computation, and Inference. Chapman 
and Hall/CRC. 

Quintana F, Iglesias P (2003). “Bayesian Clustering and Product Partition Models.” Jour¬ 
nal of the Royal Statistical Society. Series B, 65, 557-574. 

R Development Core Team (2015). R: A Language and Environment for Statistical Com¬ 
puting. R Foundation for Statistical Computing, Vienna, Austria. 

Rasmussen C, Ghahramani Z (2002). “Infinite of Mixtures of Gaussian Process Experts.” 
In T Diettrich, S Becker, Z Ghahramani (eds.). Advances in Neural Information Pro¬ 
cessing Systems, volume If, pp. 881-888. The MIT Press, Cambridge, MA. 

Regazzini E, Lijoi A, Priinster I (2003). “Distributional Results for Means of Normalized 
Random Measures with Independent Increments.” Annals of Statistics, 31, 560-585. 

Robert C, Casella G (2004). Monte Carlo Statistical Methods (2nd Ed.). Springer, New 
York. 

Rosenbaum P, Rubin D (1983a). “The Central Role of the Propensity Score in Observa¬ 
tional Studies for Causal Effects.” Biometrika, 70, 41-55. 

Rosenbaum P, Rubin D (1984). “Reducing Bias in Observational Studies Using Subclassi- 
hcation on the Propensity Score.” Journal of the American Statistical Association, 79, 
516-524. 

Rosenthal R (1994). “Parametric Measures of Effect Size.” In H Cooper, L Hedges (eds.). 
The Handbook of Research Synthesis, pp. 231-244. Russell Sage Foundation, New 
York. 

Schafer J, Kang J (2008). “Average Causal Effects from Nonrandomized Studies: A Prac¬ 
tical Guide and Simulated Example.” Psychological Methods, 13, 279-313. 

Schervish M (1995). Theory of Statistics. Springer-Verlag, New York. 

Scott D (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. 
John Wiley and Sons, New York. 

Sethuraman J (1994). “A Constructive Dehnition of Dirichlet Priors.” Statistica Sinica, f, 
639-650. 

Sethuraman J, Tiwari R (1982). “Convergence of Dirichlet Measures and the Interpretation 
of Their Parameters.” In S Gupta, J Berger (eds.). Statistical Decision Theory and 
Related Topics III: Proceedings of the Third Purdue Symposium, Volume, pp. 305-315. 
Academic Press, New York. 


36 


Silverman B (1986). Density Estimation for Statistics and Data Analysis. Chapman and 
Hall, Boca Raton, Florida. 

Stroud J, Miiller P, Sanso B (2001). “Dynamic Models for Spatiotemporal Data.” Journal 
of the Royal Statistical Society, Series B, 63, 673-689. 

Stuart E (2010). “Matching Methods for Causal Inference: A Review and a Look Forward.” 
Statistical Science, 25, 1-21. 

Thistlewaite D, Campbell D (1960). “Regression-Discontinuity Analysis: An Alternative 
to the Ex-post Facto Experiment.” Journal of Educational Psychology, 51, 309-317. 

Thompson S, Sharp S (1999). “Explaining Heterogeneity in Meta-Analysis: A Comparison 
of Methods.” Statistics in Medicine, 18, 2693-2708. 

van der Linden W (2015). Handbook of Modern Item Response Theory (Second Edition). 
CTB McGraw-Hill, Monterey, CA. 

Verbeke G, Molenbergs G (2000). Linear Mixed Models for Longitudinal Data. Springer- 
Verlag, New York. 

Walker S, Damien P, Laud P, Smith A (1999). “Bayesian Nonparametric Inference for 
Random Distributions and Related Functions.” Journal of the Royal Statistical Society, 
Series B, 61, 485-527. 

Wilk M, Gnanadesikan R (1968). “Probability Plotting Methods for the Analysis of Data.” 
Biometrika, 55, 1-17. 

Wu TF, Lin CJ, Weng R (2004). “Probability Estimates for Multi-Class Classification by 
Pairwise Coupling.” Journal of Machine Learning Research, 5, 975-1005. 


37 


Appendix A: Technical Preliminaries 

We use the following notation for random variables and probability measures (functions) 
(e.g., Berger & Casella, 2002; Schervish, 1995). A random variable is denoted by an italicized 
capital letter, such as F, with V a function from a sample space ^ to a set of real numbers. 
A realization of that random variable is denoted by a lower case, with Y = y. Also, F{-) 
(or P{-)) is the probability measure (function) that satishes the Kolmogorov probability 
axioms, having sample space domain y and range (0,1). Then F{B) (or P{B)) denotes the 
probability of any event B G y oi interest. Throughout, a probability measure is denoted 
by a capital letter such as F (or G or If resp., for example), and f{y) {g{y) and 7r{y), resp.) 
is the corresponding probability density of y, if Y is discrete or continuous. 

If y is a continuous random variable, with sample space 3^ C (d G Z+) and Borel 
(T-held B{y), and with a probability density function (p.d.f.) / dehned on y such that 
/ f{il)dy = 1, then the probability measure of / is given by F{B) = /(|/)d|/ for all 

B G B{y), with f[y) = (\F{y)/(\y >0. If F is a discrete random variable with a countable 
sample space y = {|/i,|/ 2 ,...} and Borel cx-held B{y), and with probability mass function 
(p.m.f.) / dehned on y such that TAi fiVk) = 1; then the probability measure of / is 
given by F{B) = 'E{k:y,€B} fiVk) for all B G B{y), with 0 < f{y) = PiY = y) < 1. 
Also, a cumulative distribution function (c.d.f.) is the probability measure F{y) := F{B) = 
Fix F y)) where B = {y' ■. —oo < y' < y}. The c.d.f. Fiij) corresponds to a p.d.f. 
fiy) = dF{y)/dy if F is continuous; and corresponds to a p.m.f. /(?/) = P(F = y) iiY is 
discrete. A multidimensional random variable, Y = (Fi,..., F^), has c.d.f. P(|/i, ... ,yd) = 
P{yi < 2/1, • • • ,Ffc < yd). 

Y ^ F means that the random variable F has a distribution dehned by the probability 
measure F. Thus F is also called a distribution. Notation such a F ~ F{y \ x) (or F ~ 
F{y\x.), resp.) means that the random variable Y\x (the random variable, F | x, resp.) 
has a distribution dehned by a probability measure P(-1 x) conditionally on the value x of 
a variable X (or has a distribution P(-1 x) conditionally on the value x = (xi,..., Xp) of p 
variables). Sometimes a probability distribution (measure) is notated without the • | symbol. 

We denote N(-|yU, a^), $(■) = N(-|0,1), Ga(-|a,&), IG(-|a,5), U(-|a, 6), Be(-|a,6), 
GIG(-1 a, b, q), N(-1 S), IW(- | d, S), and Di(-1 a), as the probability measures of a normal 

distribution with mean and variance parameters the standard normal distribution; 

the gamma distribution with shape and rate parameters {a,b)] the inverse-gamma distri¬ 
bution with shape and rate parameters {a,b)] the uniform distribution with minimum and 
maximum parameters (a, 6); the beta distribution with shape parameters {a,b)] the inverse- 
Gaussian distribution with mean /i > 0 and shape A > 0; the generalized inverse-Gaussian 
distribution with parameters (a, b, q) G M+ x M+ x M; the multivariate normal distribution 
for a random vector (Fi,..., F^), with mean = (/x^,..., pX dx d variance-covariance 
matrix S; the inverted-Wishart distribution with degrees of freedom d and scale matrix 
S] and of the Dirichlet distribution with precision parameters a, respectively. These dis¬ 
tributions are continuous and dehne p.d.f.s denoted by lower-case letters, with n(- \p,X), 
n(- I 0,1), ga(- \a,b), ig(- \a,b), u(- \a,b), be(- \a,b), gig(-1 a, b, q), n(-1 S), iw(- | d. S'), and 

di(- 1 a), respectively. The p.d.f. equations are found in statistical distribution textbooks 
(e.g., Johnson et ah, 1994). 
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Appendix B: BNP Data Analysis Exercises 

You may use the Bayesian regression software to answer the following data analysis 
exercises. Relevant data sets are available from the software, and are described under the 
flelp menu. 

1. (Regression) Perform regression analyses of the PIRLSlOO.csv data from the 2006 
Progress in International Reading Literacy Stndy (PIRLS). Data are from a random 
sample of 100 students who each attended one of 21 U.S. low income schools. The de¬ 
pendent variable, zREAD, is a continnous-valned stndent literacy score. Also, consider 
eight covariates (predictors): a binary (0,1) indicator of male status (MALE), student 
age (AGE), size of stndent’s class (CLSIZE), percent of English langnage learners in 
the stndent’s classroom (ELL), years of experience of the stndent’s teacher (TEXP), 
years of edncation of the stndent’s teacher (EDLEVEL), the number of students en¬ 
rolled in the stndent’s school (ENROL), and a 3-point rating of the stndent’s school 
safety (SCHSAFE = 1 is highest; SCHSAFE = 3 is lowest). There is also data on a 
school ID (SCHID) gronping variable. 

Analyze the data set using the regression models listed below, after performing a z- 
score transformation of the 8 covariates using the menu option: Modify Data Set > 
Simple variable transformations > Z-score. The SCHID grouping variable may be in- 
clnded in a mnlti-level regression model. 

- ANOVA/linear DDP model (Dirichlet process (DP) mixtnre of linear regressions). 

- An inhnite-probits model. 

- Another Bayesian nonparametric (inhnite) mixtnre of regression models. 

(a) Now, describe (represent) each of the models, mathematically and in words. 

(b) Summarize the results of your data analyses, while focusing on the relevant re¬ 
sults. Describe the relationship between the dependent variable and each covariate. 
For the inhnite-probits model, describe the clustering behavior in the posterior predic¬ 
tive distribution (density) of the dependent variable, conditionally on values of one or 
two covariates of your choice. For the DDP model, graphically describe the clustering 
behavior of the random intercept and slope parameters. 

(c) Evaluate how well the data support the assumptions of other parametric mod¬ 
els, such as the linearity of regression, the normality of the regression errors, and the 
normality of the random intercept and slope coefficients. Fit a normal linear regres¬ 
sion and normal random-effects (mnlti-level HLM) models to the data, and compare 
the predictive ht between the linear model(s) and the Bayesian nonparametric models 
above. 

2. (Binary Regression) Analyze the data using the following binary regression models, 
with binary (0,1) dependent variable READPASS, and the 8 z-scored covariates. 

- ANOVA/linear DDP logit (or probit) model (a DP mixture of regressions). 

- An inhnite-probits model for binary regression. 

~ Another Bayesian nonparametric (inhnite) mixtnre of regression models. 

(a) Describe (represent) each of the models, mathematically and in words. 

(b) Snmmarize the results of your data analyses, while focusing on the relevant results. 
Describe the relationship between the dependent variable and each covariate. For the 


39 


infinite-probits model, describe the clustering behavior in the posterior predictive dis¬ 
tribution (density) of the (latent) dependent variable, conditionally on values of 1 or 
2 covariates of your choice. For the DDF model, graphically describe the clustering 
behavior of the random intercept and slope parameters. 

(c) Evaluate how well the data support assumptions of other parametric models, 
namely, the unimodality of the (latent) regression errors, and the normality of the 
random intercept and slope coefficients. This will require fitting a probit (or logit) 
linear and/or random-effects regression model to the data. Compare the fit between 
the probit (or logit) model(s) and the Bayesian nonparametric models listed above. 

3. (Causal Analysis) Analyze the PIRLSlOO.csv data using a Bayesian nonparamet¬ 
ric regression model. Investigate the causal effect of large (versus small) class size 
(CLSIZE) on reading performance (zREAD), in the context of a regression disconti¬ 
nuity design (e.g.. Cook, 2008). For the data set, use a Modify Data Set menu option 
to construct a new (treatment) variable defined by a (0,1) indicator of large class size, 
named LARGE, where LARGE = 1 if CLSIZE > 21, and zero otherwise. Perform 
a regression of zREAD on the predictors (LARGE, CLSIZE), in order to infer the 
causal effect of large class size (LARGE = 1) versus small class size (LARGE = 
0), on the variable zREAD, conditionally on CLSIZE = 21. Do so by inferring the 
coefficient estimates of the covariate LARGE, and by performing posterior predictive 
inferences of zREAD, conditionally on LARGE = 1 versus LARGE = 0. Under 
relatively mild conditions, such comparisons provide inferences of causal effects (of 
large versus small class size), as if the treatments were randomly assigned to subjects 
associated with class sizes in a small neighborhood around CLSIZE = 21. A key con¬ 
dition (assumption) is that students have imperfect control over the CLSIZE variable, 
around the value of CLSIZE = 21 (Lee, 2008; Lee & Lemieux, 2010). 

4. (3-level data) Analyze the Classroom data set (classroomSOO.csv) using a Bayesian 
nonparametric regression model. In the data, students (Level 1) are nested within 
classrooms (Level 2) nested within schools (Level 3). A classroom identiher (classid) 
provides a level-2 grouping variable. A school identiher (schoolid) provides a level-3 
grouping variable. Model mathgain as the dependent variable, dehned by student gain 
in mathematical knowledge. Also, for the model, include at least three covariates: 
student socioeconomic status (ses), the level of experience of the student’s teacher 
(yearsteaching), and student house poverty level (housepov). 

5. (Longitudinal Data Analysis) Fit a Bayesian nonparametric regression model to 
analyze the GPA.csv data. Investigate the relationship between gpa and the covariates 
of time, job type, gender status, and possibly student ID. Consider performing an 
auto-regressive time-series analysis of the data, using the model. For this, you can use 
a Modify Data Set menu option to construct lag-terms of the dependent variable (for 
selected lag order) and then include them as additional covariates in the regression 
model. 

6. (Meta Analysis) Fit a Bayesian nonparametric regression model to perform a meta¬ 
analysis of the Calendar.CSV data. The dependent variable is Effect size, here dehned 
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by the unbiased standardized difference (Hedges, 1981) between mean student achieve¬ 
ment in schools that follow a year-round calendar, and (minus) the mean student 
achievement in schools that follow the traditional nine-month calendar. The covariates 
are standard error of the effect size (SE_ES), and study publication Year; and the 
observation weights are given by the variable weight (= 1/Var). The overall effect- 
size, over studies, is represented by the model’s intercept parameter. In your report 
of the data analysis results, common on the impact of publication bias on estimates 
of the overall effect size. Publication bias may be assessed by inferring the slope co¬ 
efficient estimate of the SE_ES covariate; and/or by posterior predictive inferences of 
the Effect dependent variable, conditionally over a range of values of SE_ES. Such 
inferences provide regression analyses for a funnel plot (Thompson & Sharp, 1999). 

7. (Survival Analysis of censored data) Use a Bayesian nonparametric regression 
model to perform a survival analysis of the larynx.csv data or the bcdeter.csv data. 
The larynx data set has dependent variable logyears; and covariates of patient age, 
cancer stage, and diagnosis year (diagyr). The observations of logyears are ei¬ 
ther right-censored or uncensored, as indicated by the variables LB and UB. The 
bcdeter data set has dependent variable logmonths, and treatment type indicator 
covariates (radioth, radChemo). The logmonths observations are either right- 
censored, interval-censored, or uncensored, as indicated by the variables logLBplusl 
and logUBplusl. 

8. (Item Response Analysis) Use a BNP model to perform an item response (IRT) 
analysis of either the NAEP75.csv data set, or the Teacher49.csv data set. The 
NAEP75 data has binary item-response scores (0 = Incorrect, 1 = Correct). The 
Teacher49 data has ordinal item response scores (0, 1, or 2). Each data set has infor¬ 
mation about an examinee, per data row, with examinee item responses in multiple 
data columns. For either data set, use a ’’vectorize” Modify Data Set menu option, to 
collapse the multiple item response variables (columns) into a single column dependent 
variable, and to construct item dummy (0 or —1) covariates. Report the (marginal) 
posterior estimates of examinee ability, item difficulty (for each test item), as well as 
the estimates of the other model parameters. 
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